diff options
author | stanton <stanton> | 1999-04-16 00:46:29 (GMT) |
---|---|---|
committer | stanton <stanton> | 1999-04-16 00:46:29 (GMT) |
commit | 97464e6cba8eb0008cf2727c15718671992b913f (patch) | |
tree | ce9959f2747257d98d52ec8d18bf3b0de99b9535 /tools/encoding/cjk.inf | |
parent | a8c96ddb94d1483a9de5e340b740cb74ef6cafa7 (diff) | |
download | tcl-97464e6cba8eb0008cf2727c15718671992b913f.zip tcl-97464e6cba8eb0008cf2727c15718671992b913f.tar.gz tcl-97464e6cba8eb0008cf2727c15718671992b913f.tar.bz2 |
merged tcl 8.1 branch back into the main trunk
Diffstat (limited to 'tools/encoding/cjk.inf')
-rw-r--r-- | tools/encoding/cjk.inf | 4467 |
1 files changed, 4467 insertions, 0 deletions
diff --git a/tools/encoding/cjk.inf b/tools/encoding/cjk.inf new file mode 100644 index 0000000..9fbe527 --- /dev/null +++ b/tools/encoding/cjk.inf @@ -0,0 +1,4467 @@ +--- BEGIN (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES --- +CJK.INF Version 2.1 (July 12, 1996) + +Copyright (C) 1995-1996 Ken Lunde. All Rights Reserved. + +CJK is a registered trademark and service mark of The Research + Libraries Group, Inc. + +Online Companion to "Understanding Japanese Information Processing" +- ENGLISH: 1993, O'Reilly & Associates, Inc., ISBN 1-56592-043-0 +- JAPANESE: 1995, SOFTBANK Corporation, ISBN 4-89052-708-7 + + + This online document provides information on CJK (that is, +Chinese, Japanese, and Korean) character set standards and encoding +systems. In short, it provides detailed information on how CJK text is +handled electronically. I am happy to share this information with +others, and I would appreciate any comments/feedback on its content. +The current version (master copy) of this document is maintained at: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + +This file may also be obtained by contacting me directly using one of +the e-mail addresses listed in the CONTACT INFORMATION section. + + +TABLE OF CONTENTS + + VERSION HISTORY + RESTRICTIONS + CONTACT INFORMATION + WHAT HAPPENED TO JAPAN.INF? + DISCLAIMER + CONVENTIONS + INTRODUCTION + PART 1: WHAT'S UP WITH UJIP? + PART 2: CJK CHARACTER SET STANDARDS + 2.1: JAPANESE + 2.1.1: JIS X 0201-1976 + 2.1.2: JIS X 0208-1990 + 2.1.3: JIS X 0212-1990 + 2.1.4: JIS X 0221-1995 + 2.1.5: JIS X 0213-199X + 2.1.6: OBSOLETE STANDARDS + 2.2: CHINESE (PRC) + 2.2.1: GB 1988-89 + 2.2.2: GB 2312-80 + 2.2.3: GB 6345.1-86 + 2.2.4: GB 7589-87 + 2.2.5: GB 7590-87 + 2.2.6: GB 8565.2-88 + 2.2.7: GB/T 12345-90 + 2.2.8: GB/T 13131-9X + 2.2.9: GB/T 13132-9X + 2.2.10: GB 13000.1-93 + 2.2.11: ISO-IR-165:1992 + 2.2.12: OBSOLETE STANDARDS + 2.3: CHINESE (TAIWAN) + 2.3.1: BIG FIVE + 2.3.2: CNS 11643-1992 + 2.3.3: CNS 5205 + 2.3.4: OBSOLETE STANDARDS + 2.4: KOREAN + 2.4.1: KS C 5636-1993 + 2.4.2: KS C 5601-1992 + 2.4.3: KS C 5657-1991 + 2.4.4: GB 12052-89 + 2.4.5: KS C 5700-1995 + 2.4.6: OBSOLETE STANDARDS + 2.5: CJK + 2.5.1: ISO 10646-1:1993 + 2.5.2: CCCII + 2.5.3: ANSI Z39.64-1989 + 2.6: OTHER + 2.6.1: GB 8045-87 + 2.6.2: TCVN-5773:1993 + PART 3: CJK ENCODING SYSTEMS + 3.1: 7-BIT ISO 2022 ENCODING + 3.1.1: CODE SPACE + 3.1.2: ISO-REGISTERED ESCAPE SEQUENCES + 3.1.3: ISO-2022-JP AND ISO-2022-JP-2 + 3.1.4: ISO-2022-KR + 3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT + 3.2: EUC ENCODING + 3.2.1: JAPANESE REPRESENTATION + 3.2.2: CHINESE (PRC) REPRESENTATION + 3.2.3: CHINESE (TAIWAN) REPRESENTATION + 3.2.4: KOREAN REPRESENTATION + 3.3: LOCALE-SPECIFIC ENCODINGS + 3.3.1: SHIFT-JIS + 3.3.2: HZ (HZ-GB-2312) + 3.3.3: zW + 3.3.4: BIG FIVE + 3.3.5: JOHAB + 3.3.6: N-BYTE HANGUL + 3.3.7: UCS-2 + 3.3.8: UCS-4 + 3.3.9: UTF-7 + 3.3.10: UTF-8 + 3.3.11: UTF-16 + 3.3.12: ANSI Z39.64-1989 + 3.3.13: BASE64 + 3.3.14: IBM DBCS-HOST + 3.3.15: IBM DBCS-PC + 3.3.16: IBM DBCS-/TBCS-EUC + 3.3.17: UNIFIED HANGUL CODE + 3.3.18: TRON CODE + 3.3.19: GBK + 3.4: CJK CODE PAGES + PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES + 4.1: JAPANESE + 4.2: CHINESE (PRC) + 4.3: CHINESE (TAIWAN) + 4.4: KOREAN + 4.5: ISO 10646-1:1993 + 4.6: UNICODE + 4.7: CODE CONVERSION TIPS + PART 5: CJK-CAPABLE OPERATING SYSTEMS + 5.1: MS-DOS + 5.2: WINDOWS + 5.3: MACINTOSH + 5.4: UNIX AND X WINDOWS + 5.5: OTHERS + PART 6: CJK TEXT AND INTERNET SERVICES + 6.1: ELECTRONIC MAIL + 6.2: USENET NEWS + 6.3: GOPHER + 6.4: WORLD-WIDE WEB + 6.5: FILE TRANSFER TIPS + PART 7: CJK TEXT HANDLING SOFTWARE + 7.1: MULE + 7.2: CNPRINT + 7.3: MASS + 7.4: ADOBE TYPE MANAGER (ATM) + 7.5: MACINTOSH SOFTWARE + 7.6: MACBLUE TELNET + 7.7: CXTERM + 7.8: UW-DBM + 7.9: POSTSCRIPT + 7.10: NJWIN + PART 8: CJK PROGRAMMING ISSUES + 8.1: C AND C++ + 8.2: PERL + 8.3: JAVA + A FINAL NOTE + ACKNOWLEDGMENTS + APPENDIX A: INFORMATION SOURCES + A.1: USENET NEWSGROUPS AND MAILING LISTS + A.1.1: USENET NEWSGROUPS + A.1.2: MAILING LISTS + A.2: INTERNET RESOURCES + A.2.1: USEFUL FTP SITES + A.2.2: USEFUL TELNET SITES + A.2.3: USEFUL GOPHER SITES + A.2.4: USEFUL WWW SITES + A.2.5: USEFUL MAIL SERVERS + A.3: OTHER RESOURCES + A.3.1: BOOKS + A.3.2: MAGAZINES + A.3.3: JOURNALS + A.3.4: RFCs + A.3.5: FAQs + + +VERSION HISTORY + + The following is a complete listing of the earlier versions of +this document along with their release dates and sizes (in bytes): + + Document Version Release Date Size + ^^^^^^^^ ^^^^^^^ ^^^^^^^^^^^^ ^^^^ + JAPAN.INF 1.0 Unknown Unknown + JAPAN.INF 1.1 08/19/91 101,784 + JAPAN.INF 1.2 03/20/92 166,929 (JIS) or 165,639 (Shift-JIS/EUC) + CJK.INF 1.0 06/09/95 103,985 + CJK.INF 1.1 06/12/95 112,771 + CJK.INF 1.2 06/14/95 125,275 + CJK.INF 1.3 06/16/95 130,069 + CJK.INF 1.4 06/19/95 142,543 + CJK.INF 1.5 06/22/95 146,064 + CJK.INF 1.6 06/29/95 150,882 + CJK.INF 1.7 08/15/95 153,772 + CJK.INF 1.8 09/11/95 157,295 + CJK.INF 1.9 12/18/95 170,698 + CJK.INF 2.0 03/12/96 175,973 + +With the release of this version, all of the above are now considered +obsolete. Also, note the three-year gap between the last installment +of JAPAN.INF and the first installment of CJK.INF -- I was writing +UJIP and my PhD dissertation during those three years. Ah, so much for +excuses... + + +RESTRICTIONS + + This document is provided free-of-charge to *anyone*, but no +person or company is permitted to modify, sell, or otherwise +distribute it for profit or other purposes. This document may be +bundled with commercial products only with the prior consent from the +author, and provided that it is not modified in any way whatsoever. +The point here is that I worked long and hard on this document so that +lots of fine folks and companies can benefit from its contents -- not +profit from it. + + +CONTACT INFORMATION + + I would enjoy hearing from readers of this document, even if +it is just to say "hello" or whatever. I can be contacted as follows: + + Ken Lunde + Adobe Systems Incorporated + 1585 Charleston Road + P.O. Box 7900 + Mountain View, CA 94039-7900 USA + 415-962-3866 (office phone) + 415-960-0886 (facsimile) + lunde@adobe.com (preferred) + lunde@ora.com or ujip@ora.com + WWW Home Page: http://jasper.ora.com/lunde/ + +If you wonder what I do for my day job, read on. + I have been working for Adobe Systems for over four years now +(before that I was a graduate student at UW-Madison), and my current +position is Project Manager, CJK Type Development. + + +WHAT HAPPENED TO JAPAN.INF? + + Put bluntly, JAPAN.INF died. It first evolved into my first +book entitled "Understanding Japanese Information Processing" (this +book is now into its second printing, and the Japanese translation was +just published). After my book came out, I did attempt to update +JAPAN.INF, but the effort felt a bit futile. I decided that something +fresh was necessary. + JAPAN.INF also evolved into this document, which breaks the +Japanese barrier by providing similar information on Chinese and +Korean character sets and encodings. It fills the Chinese and Korean +gap, so to speak. My specialty (and hobby, believe it or not) is the +field of CJK character sets and encoding systems, so I felt that +shifting this document more towards those lines was appropriate use of +my (copious) free time (I wish there were more than 24 hours in a +day!). Besides, this document now becomes useful to a much broader +audience. + + +DISCLAIMER + + Ah yes, the ever popular disclaimer! Here's mine. Although I +list my address here at Adobe Systems Incorporated for contact +purposes, Adobe Systems does not endorse this document which I have +created, and have continued (and will continue) to update on a regular +basis (uh, yeah, I promise this time!). This document is a personal +endeavor to inform people of how CJK text can be handled on a variety +of platforms. + + +CONVENTIONS + + The notation that is used for detailing Internet resource +information, such as the Internet protocol type, site name, path, and +file follows the URL (Uniform Resource Locator) notation, namely: + + protocol://site-name/path/file + +An example URL is as follows: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/00README + +The protocol is FTP, the site-name is ftp.ora.com, the path is pub/ +examples/nutshell/ujip/, and the file is 00README. Also note that this +same notation is used for invoking FTP on WWW (World Wide Web) +browsing software, such as Mosaic, Netscape, or Lynx. + Note that most references to HTTP documents use the four- +letter file extension ".html". However, some HTTP documents are on +file systems that support only three-letter file extensions (can you +say "MS-DOS"?), so you may encounter just ".htm". This is just to let +you know that what you see is not a typo. + References to my book "Understanding Japanese Information +Processing" are (affectionately) abbreviated as UJIP. These references +also apply to the Japanese translation (UJIP-J). + Hexadecimal values are prefixed with 0x, and every two +hexadecimal digits represent a one-byte value. Other values can be +assumed to be in decimal notation. + Chinese characters are referred to as kanji (Japanese), hanzi +(Chinese), or hanja (Korean), depending on context. + References to ISO 10646-1:1993 also refer to Unicode +(usually). I have done this so that I do not have to repeat "Unicode" +in the same context as ISO 10646-1:1993. There are times, however, +when I need to distinguish ISO 10646-1:1993 from Unicode. + + +INTRODUCTION + + Electronic mail (e-mail), just one of the many Internet +resources, has become a very efficient means of communicating both +locally and world-wide. While it is very simple to send text which +uses only the 94 printable ASCII characters, character sets that +contain more than these ASCII characters pose special problems. + This document is primarily concerned with CJK character set +and encoding issues. Much of this sort of information is not easily +obtained. This represents one person's attempt at making such +information more widely available. + + +PART 1: WHAT'S UP WITH UJIP? + + UJIP (First Edition) was published in September 1993 by +O'Reilly & Associates, Incorporated. The second printing (*not* the +Second Edition) was subsequently published in March 1994. The page +count for both printings is unchanged at 470. + The following files contain the latest information about +changes (additions and corrections) made to UJIP and UJIP-J for +various printings, both for those that have taken place (such as for +the second printing of the English edition) and for those that are +planned (the first digit is the edition, and the second is the +printing): + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-2.txt + ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-3.txt + ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-j-errata-1-2.txt + +I *highly* recommend that all readers of UJIP obtain these errata +files. Those without FTP access can request copies directly from me. + The Japanese translation of UJIP (UJIP-J), co-published by +O'Reilly & Associates, Incorporated and SOFTBANK Corporation, was just +released. The translation was done by my good friend Jack Halpern, +along with one of his colleagues, Takeo Suzuki. The Japanese edition +incorporates corrections and updates not yet found in the English +edition. The page count is 535. + Late-breaking news! I am currently working on UJIP Second +Edition (to be retitled as "Understanding CJK Information Processing" +and abbreviated UCJKIP). If all goes well, it should be available by +January 1997, and will be well over 700 pages. If there was something +you wanted to see in UJIP, now's your chance to send me a request... + + +PART 2: CJK CHARACTER SET STANDARDS + + These sections describe the character sets used in Japan, +China (PRC and Taiwan), and Korea. Exact numbers of characters are +provided for each character set standard (when known), as well as +tidbits of information not otherwise available. This provides the +basic foundations for understanding how CJK scripts are handled on +computer systems. + The two basic types of characters enumerated by CJK character +set standards are Chinese characters (kanji, hanzi, or hanja), which +number in the thousands (and, in some cases, tens of thousands), and +characters other than Chinese characters (symbols, numerals, kana +hangul, alphabets, and so on), which usually number in the hundreds +(there are thousands of pre-combined hangul, though). + If you happen to be running X Windows, it is very easy to +display these CJK character sets (if a bitmapped font for the +character set exists, that is). Here is what I usually do: + +o Obtain a BDF (Bitmap Distribution Format) font for the target + character set. Try the following URLs for starters: + + ftp://cair-archive.kaist.ac.kr/pub/hangul/fonts/ + ftp://etlport.etl.go.jp/pub/mule/fonts/ + ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/bdf/ + ftp://ftp.kuis.kyoto-u.ac.jp/misc/fonts/jisksp-fonts/ + ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/fonts/ + ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/ + ftp://ftp.technet.sg:/pub/chinese/fonts/ + http://ccic.ifcss.org/www/pub/software/fonts/ + + BDF files usually have the string "bdf" somewhere in their file + name, usually at the end. If the file is compressed (noticing that + it ends in .gz or .Z is a good indication), decompress it. BDF files + are text files. + +o Convert the BDF file to SNF (Server Natural Format) or PCF (Portable + Compiled Format) using the programs "bdftosnf" or "bdftopcf," + respectively. Example command lines are as follows: + + % bdftopcf jiskan16-1990.bdf > k16-90.pcf + % bdftosnf jiskan16-1990.bdf > k16-90.snf + + SNF files (and the "bdftosnf" program) are used on X11R4 and + earlier, and PCF files (and the "bdftopcf" program) are used on + X11R5 and later. + +o Copy the SNF or PCF file to a directory in the font search path (or + make a new path). Supposing I made a new directory called "fonts" in + my home directory, I then run "mkfontdir" on the directory + containing the SNF or PCF files as follows: + + % mkfontdir ~/fonts + + This creates a fonts.dir file in ~/fonts. I can now add this + directory to my font search path with the following command: + + % xset +fp ~/fonts + +o The command "xfd" (X Font Displayer) with the "-fn" switch followed + by a font name then invokes a window that displays all the + characters of the font. In the case of two-byte (CJK) fonts, one row + is displayed at a time. The following is an example command line: + + % xfd -fn -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0 + + You can create a "fonts.alias" file in the same directory as the + "fonts.dir" file in order to shorten the name when accessing the + font. The alias "k16-90" could be used instead if the content of the + fonts.alias file is as follows: + + k16-90 -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0 + + Don't forget to execute the following command in order to make the + X Font Server aware of the new alias: + + % xset fp rehash + + Now you can use a simpler command line for "xfd" as follows: + + % xfd -fn k16-90 + + The "X Window System User's Guide" (Volume 3 of the X Window +System series by O'Reilly & Associates, Inc.) provides detailed +information on managing fonts under X Windows (pp 123-160). The +article entitled "The X Administrator: Font Formats and Utilities" (pp +14-34 in "The X Resource," Issue 2), describes the BDF, SNF, and PCF +formats in great detail. + There is another bitmap format called HBF (Hanzi Bitmap +Format), which is similar to BDF, but optimized for fixed-width +(monospaced) fonts. It is described in the article entitled "The HBF +Font Format: Optimizing Fixed-pitch Font Support" (pp 113-123 in "The +X Resource," Issue 10), and also at the following URL: + + ftp://ftp.ifcss.org/pub/software/fonts/hbf-discussion/ + +HBF fonts can be found at the following URL: + + ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/hbf/ + + Lastly, you may wish to check out my newly-developed CJK +Character Set Server, which generates various CJK character sets with +proper encoding applied. It is written in Perl, and accessed through +an HTML form. This server can be considered an upgrade to my JChar +tool (written in C). The URL is: + + http://jasper.ora.com/lunde/cjk-char.html + + +2.1: JAPANESE + + All (national) character set standards that originate in Japan +have names that begin with the three letters JIS. JIS is short for +"Japanese Industrial Standard." But it is JSA (Japanese Standards +Association) who publishes the corresponding manuals. Chapter 3 and +Appendixes H and J of UJIP provide more detailed information on +Japanese character set standards. + + +2.1.1: JIS X 0201-1976 + + JIS X 0201-1976 (formerly JIS C 6220-1969; reaffirmed in 1989; +and its revision [with no character set changes] is currently under +public review) enumerates two sets of characters: JIS-Roman and +half-width katakana. + JIS-Roman is the Japanese equivalent of the ASCII character +set, namely 128 characters consisting of the following: + +o 10 numerals +o 52 uppercase and lowercase characters of the Latin alphabet +o 32 symbols (punctuation and so on) +o 34 non-printing characters (white space and control characters) + +The term "white space" refers to characters that occupy space, but +have no appearance, such as tabs, spaces, and termination characters +(line feed, carriage return, and form feed). + So, how are JIS-Roman and ASCII different? The following +three codes are (usually) different: + + Code ASCII JIS-Roman + ^^^^ ^^^^^ ^^^^^^^^^ + 0x5C backslash yen symbol + 0x7C broken bar bar + 0x7E tilde overbar + + Half-width katakana consists of 63 characters that provide a +minimal set of characters necessary for expressing Japanese. The +shapes are compressed, and visually occupy a space half that of +*normal* Japanese characters. + + +2.1.2: JIS X 0208-1990 + + This basic Japanese character set standard enumerates 6,879 +characters, 6,355 of which are kanji separated into two levels. Kanji +in the first level are arranged by (most frequent) reading, and those +in the second level are arranged by radical then total number of +(remaining) strokes. + +o Row 1: 94 symbols +o Row 2: 53 symbols +o Row 3: 10 numerals and 52 uppercase and lowercase Latin alphabet +o Row 4: 83 hiragana +o Row 5: 86 katakana +o Row 6: 48 uppercase and lowercase Greek alphabet +o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet +o Row 8: 32 line-drawing elements +o Rows 16 through 47: 2,965 kanji (JIS Level 1 Kanji; last is 47-51) +o Rows 48 through 84: 3,390 kanji (JIS Level 2 Kanji; last is 84-06) + +Appendix B of UJIP provides a complete illustration of the JIS X +0208-1990 character set standard by KUTEN (row-cell) code. Appendix G +(pp 294-317) of "Developing International Software for Windows 95 and +Windows NT" by Nadine Kano illustrates the JIS X 0208-1990 character +set standard plus the Microsoft extensions by Shift-JIS code +(Microsoft calls this Code Page 932). + Earlier versions of this standard were dated 1978 (JIS C +6226-1978) and 1983 (JIS X 0208-1983, formerly JIS C 6226-1983). + JIS X 0208 went through a revision (from November 1995 until +February 1996), and is slated for publication sometime in 1996 (to +become JIS X 0208-1996). More information on this revision is +available at the following URL: + + ftp://ftp.tiu.ac.jp/jis/jisx0208/ + + +2.1.3: JIS X 0212-1990 + + This supplemental Japanese character set standard enumerates +6,067 characters, 5,801 of which are kanji ordered by radical then +total number of (remaining) strokes. All 5,801 kanji are unique when +compared to those in JIS X 0208-1990 (see Section 2.1.2). The +remaining 266 characters are categorized as non-kanji. + +o Row 2: 21 diacritics and symbols +o Row 6: 21 Greek characters with diacritics +o Row 7: 26 Eastern European characters +o Rows 9 through 11: 198 alphabetic characters +o Rows 16 through 77: 5,801 kanji (last is 77-67) + +Appendix C of UJIP provides a complete illustration of the JIS X +0212-1990 character set standard by KUTEN (row-cell) code. + The only commercial operating system that provides JIS X +0212-1990 support is BTRON by Personal Media Corporation: + + http://www.personal-media.co.jp/ + +Section 3.3.18 provides information about TRON Code (used by BTRON), +and details how it encodes the JIS X 0212-1990 character set. + + +2.1.4: JIS X 0221-1995 + + This document is, for all practical purposes, the Japanese +translation of ISO 10646-1:1993 (see Section 2.5.1). Like ISO +10646-1:1993, it is based on Unicode Version 1.1. + It is noteworthy that JIS X 0221-1995 enumerates subsets that +are applicable for Japanese use (a brief description of their contents +in parentheses): + +o BASIC JAPANESE (JIS X 0208-1990 and JIS X 0201-1976 -- characters + that can be created by means of combining are not included -- 6,884 + characters) +o JAPANESE NON IDEOGRAPHICS SUPPLEMENT (1,913 characters: all non- + kanji of JIS X 0212-1990 plus hundreds of non-JIS characters) +o JAPANESE IDEOGRAPHICS SUPPLEMENT 1 (918 frequently-used kanji from + JIS X 0212-1990, including 28 that are identical to kanji forms in + JIS C 6226-1978) +o JAPANESE IDEOGRAPHICS SUPPLEMENT 2 (the remainder of JIS X 0212- + 1990, namely 4,883 kanji) +o JAPANESE IDEOGRAPHICS SUPPLEMENT 3 (the remaining kanji of ISO + 10646-1:1993, namely 8,746 characters) +o FULLWIDTH ALPHANUMERICS (94 characters; for compatibility) +o HALFWIDTH KATAKANA (63 characters; for compatibility) + + Pages 893 through 993 provide Kangxi Zidian (a classic +300-year-old Chinese character dictionary containing approximately +50,000 characters) and Dai Kanwa Jiten (also known as Morohashi) +indexes for the entire Chinese character block, namely from 0x4E00 +through 0x9FA5. + At 25,750 Yen, it is actually cheaper than ISO 10646-1:1993! + + +2.1.5: JIS X 0213-199X + + I recently became aware that JSA plans to publish an extension +to JIS X 0208, containing approximately 2,000 characters (kanji and +non-kanji). A public review of this new standard is planned for Summer +1996. I would expect that its information will eventually be available +at the following URL: + + ftp://ftp.tiu.ac.jp/jis/ + + +2.1.6: OBSOLETE STANDARDS + + JIS C 6226-1978 and JIS X 0208-1983 (formerly JIS C 6226-1983) +have been superseded by JIS X 0208-1990. Section 4.1 provides details +on the changes made between these earlier versions of JIS X 0208. + JIS X 0221-1995 does not mean the end of JIS X 0201-1976, JIS +X 0208-1990, and JIS X 0212-1990. Instead, it will co-exist with those +standards. + + +2.2: CHINESE (PRC) + + All character set standards that originate in PRC have +designations that begin with "GB." "GB" is short for "Guo Biao" (which +is, in turn, short for "Guojia Biaojun") and means "National +Standard." A select few also have "/T" attached. The "T" presumably is +short for "Traditional." Section 2.2.11 describes ISO-IR-165:1992, +which is a variant of GB 2312-80. It is included here because of this +relationship. + Most people correlate GB character set standards with +simplified Chinese, but as you will see below, that is not always the +case. + There are three basic character sets, each one having a +simplified and traditional version. + + Character Set Set Number Character Forms + ^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^^^^ + GB 2312-80 0 Simplified + GB/T 12345-90 1 Traditional of GB 2312-80 + GB 7589-87 2 Simplified + GB/T 13131-9X 3 Traditional of GB 7589-87 + GB 7590-87 4 Simplified + GB/T 13132-9X 5 Traditional of GB 7590-87 + + +2.2.1: GB 1988-89 + + This character set, formerly GB 1988-80 and sometimes referred +to as GB-Roman, is the Chinese analog to ASCII and ISO 646. The main +difference is that the currency symbol (0x24), which is represented as +a dollar sign ($) in ASCII, is represented as a Chinese Yuan +(currency) symbol instead. GB 1988-89 is sometimes referred to as +GB-Roman. + + +2.2.2: GB 2312-80 + + This basic (simplified) Chinese character set standard +enumerates 7,445 characters, 6,763 of which are hanzi separated into +two levels. Hanzi in the first level are arranged by reading, and +those in the second level are arranges by radical then total number of +(remaining) strokes. GB 2312-80 is also known as the "Primary Set," +GB0 (zero), or just GB. + +o Row 1: 94 symbols +o Row 2: 72 numerals +o Row 3: 94 full-width GB 1988-89 characters (see Section 2.2.1) +o Row 4: 83 hiragana +o Row 5: 86 katakana +o Row 6: 48 uppercase and lowercase Greek alphabet +o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet +o Row 8: 26 Pinyin and 37 Bopomofo characters +o Row 9: 76 line-drawing elements (09-04 through 09-79) +o Rows 16 through 55: 3,755 hanzi (Level 1 Hanzi; last is 55-89) +o Rows 56 through 87: 3,008 hanzi (Level 2 Hanzi; last is 87-94) + +Compare some of the structure with JIS X 0208-1990, and you will find +many similarities, such as: + +o Hiragana, katakana, Greek, and Cyrillic characters are in Rows 4, 5, + 6, and 7, respectively +o Chinese characters begin at Row 16 +o Chinese characters are separated into two levels +o Level 1 arranged by reading +o Level 2 arranged by radical then total number of strokes + +The Japanese standard, JIS C 6226-1978, came out in 1978, which means +that it pre-dates GB 2312-80. The above similarities could not be by +coincidence, but rather by design. + Appendix G (pp 318-344) of "Developing International Software +for Windows 95 and Windows NT" by Nadine Kano illustrates the GB 2312- +80 character set standard by EUC code (Microsoft calls this Code Page +936). Code Page 936 incorporates the correction of the hanzi at 79-81, +and the correction of the order of 07-22 and 07-23 (see Section 2.2.3 +for more details). + + +2.2.3: GB 6345.1-86 + + This document specifies corrections and additions to GB +2312-80 (see Section 2.2.2). The following is a detailed enumeration +of the changes: + +o The form of "g" in Row 3 (position 71) was altered +o Row 8 has six additional Pinyin characters (08-27 through 08-32) +o Row 10 contains half-width versions of Row 3 (94 characters) +o Row 11 contains half-width versions of the Pinyin characters from + Row 8 (32 characters; 11-01 through 11-32) +o The hanzi at 79-81 was corrected to have a simplified left-side + radical (this was an error in GB 2312-80) + +Note that these changes affect the total number of characters in GB +2312-80 -- an increase of 132 characters. This now makes 7,577 as +the total number of characters in GB 2312-80 (7,445 plus 132). + There was, however, an undocumented correction made in GB +6345.1-86. The order of characters 07-22 and 07-23 (uppercase +Cyrillic) were reversed. This error is apparently in the first and +perhaps second printing of the GB 2312-80 manual, because the copy I +have is from the third printing, and this has been corrected. Page 145 +(Figure 113) of John Clews' "Language Automation Worldwide: The +Development of Character Set Standards" illustrates this error. +Developers should take special note of this -- I have seen GB 2312-80 +based font products that propagate this ordering error. + + +2.2.4: GB 7589-87 + + This character set enumerates 7,237 hanzi in Rows 16 through +92 (last is 92-93), and they are ordered by radical then total number +of (remaining) strokes. GB 7589-87 is also known as the "Second +Supplementary Set" or GB2. + + +2.2.5: GB 7590-87 + + This character set enumerates 7,039 hanzi in Rows 16 through +90 (last is 90-83), and they are ordered by radical then total number +of (remaining) strokes. GB 7590-87 is also known as the "Fourth +Supplementary Set" or GB4. + + +2.2.6: GB 8565.2-88 + + This standard makes additions to GB 2312-80 (these additions +are separate from those made in GB 6345.1-86 described in Section +2.2.3). GB 8565.2-88 is also known as GB8. In this case there are 705 +additions, indicated as follows: + +o Row 13 contains 50 hanzi from GB 7589-87 (last is 13-50) +o Row 14 contains 92 hanzi from GB 7590-87 (last is 14-92) +o Row 15 contains 69 non-hanzi indicating dates and times, plus 24 + miscellaneous hanzi (for personal/place names and radicals; last is + 15-93). +o Rows 90 through 94 contain 470 hanzi from GB 7589-87 (94 each) + +GB 8565.2-88 therefore provides a total of 8,150 characters (7,445 +plus 705). + + +2.2.7: GB/T 12345-90 + + This character set is nearly identical to GB 2312-80 (see +Section 2.2.2) in terms of the number and arrangement of characters, +but simplified hanzi are replaced by their traditional versions. GB/T +12345-90 is also known as the "Supplementary Set" or GB1. + The following are some interesting facts about this character +set (some instances of simplified/traditional pairs that appear below +are actually character form differences): + +o 29 vertical-use characters (punctuation and parentheses) included in + Row 6 (06-57 through 06-85). + +o 2,118 traditional hanzi replace simplified hanzi in Rows 16 through + 87. The "G1-Unique" appendix of the unofficial version (supplied to + the CJK-JRG for Han Unification purposes) is missing the following + four (specifies only 2,114): + + 0x5B3B 0x6D2F + 0x5E7C 0x6F71 + + But, ISO 10646-1:1993 ended up getting these hanzi included anyway, + with correct mappings. + +o Four simplified/traditional hanzi pairs (eight affected code points) + in rows 16 through 87 are swapped: + + 0x3A73 <-> 0x6161 + 0x5577 <-> 0x6167 + 0x5360 <-> 0x6245 (see the next bullet) + 0x4334 <-> 0x7761 + +o One hanzi (0x6245), after being swapped, had its left-side radical + unsimplified (this character, now at 0x5360, is considered part of + the 2,118 traditional hanzi from the second bullet): + + 0x6245 -> 0x5360 + +o 103 hanzi included in Rows 88 (94 characters) and 89 (9 characters; + 89-01 through 89-09). These are all related to characters between + Rows 16 and 87. + + - 41 simplified hanzi from Rows 16 through 87 moved to Rows 88 and + 89 (traditional hanzi are now at the original code points): + + 0x3327 -> 0x7827 0x3E5D -> 0x7846 0x4B49 -> 0x7869 + 0x3365 -> 0x7828 0x3F64 -> 0x7849 0x4C28 -> 0x786B + 0x3373 -> 0x7829 0x402F -> 0x784B 0x4D3F -> 0x786F + 0x3533 -> 0x782C 0x4030 -> 0x784C 0x4D72 -> 0x7871 + 0x356D -> 0x782D 0x406F -> 0x784E 0x5236 -> 0x7878 + 0x3637 -> 0x782F 0x4131 -> 0x7850 0x5374 -> 0x7879 + 0x3736 -> 0x7832 0x463B -> 0x785C 0x5438 -> 0x787C + 0x3761 -> 0x7833 0x463E -> 0x785D 0x5446 -> 0x787D + 0x3849 -> 0x7835 0x464B -> 0x785E 0x5622 -> 0x7921 + 0x3963 -> 0x7838 0x464D -> 0x785F 0x563B -> 0x7923 + 0x3B2E -> 0x783B 0x4653 -> 0x7860 0x5656 -> 0x7926 + 0x3C38 -> 0x7840 0x4837 -> 0x7866 0x567E -> 0x7928 + 0x3C5B -> 0x7842 0x4961 -> 0x7867 0x573C -> 0x7929 + 0x3C76 -> 0x7843 0x4A75 -> 0x7868 + + - 62 hanzi added to Rows 88 and 89 (the gaps from the above are + filled in). These were mostly to account for multiple traditional + hanzi collapsing into a single simplified form. + + - The following code point mappings illustrate how all of these 103 + hanzi are related to hanzi between Rows 16 and 87 (note how many + of these 103 hanzi map to a single code point): + + 0x7821 -> 0x305A 0x7844 -> 0x3D2A 0x7867 -> 0x4961 + 0x7822 -> 0x3065 0x7845 -> 0x3E21 0x7868 -> 0x4A75 + 0x7823 -> 0x316D 0x7846 -> 0x3E5D 0x7869 -> 0x4B49 + 0x7824 -> 0x3170 0x7847 -> 0x3E6D 0x786A -> 0x4B55 + 0x7825 -> 0x3237 0x7848 -> 0x3F4B 0x786B -> 0x4C28 + 0x7826 -> 0x3245 0x7849 -> 0x3F64 0x786C -> 0x4C28 + 0x7827 -> 0x3327 0x784A -> 0x4027 0x786D -> 0x4C28 + 0x7828 -> 0x3365 0x784B -> 0x402F 0x786E -> 0x4C33 + 0x7829 -> 0x3373 0x784C -> 0x4030 0x786F -> 0x4D3F + 0x782A -> 0x3376 0x784D -> 0x405B 0x7870 -> 0x4D45 + 0x782B -> 0x3531 0x784E -> 0x406F 0x7871 -> 0x4D72 + 0x782C -> 0x3533 0x784F -> 0x407A 0x7872 -> 0x4F35 + 0x782D -> 0x356D 0x7850 -> 0x4131 0x7873 -> 0x4F35 + 0x782E -> 0x362C 0x7851 -> 0x414B 0x7874 -> 0x4F4C + 0x782F -> 0x3637 0x7852 -> 0x4231 0x7875 -> 0x4F72 + 0x7830 -> 0x3671 0x7853 -> 0x425E 0x7876 -> 0x506B + 0x7831 -> 0x3722 0x7854 -> 0x4339 0x7877 -> 0x5229 + 0x7832 -> 0x3736 0x7855 -> 0x4349 0x7878 -> 0x5236 + 0x7833 -> 0x3761 0x7856 -> 0x4349 0x7879 -> 0x5374 + 0x7834 -> 0x3834 0x7857 -> 0x4349 0x787A -> 0x5379 + 0x7835 -> 0x3849 0x7858 -> 0x4356 0x787B -> 0x5375 + 0x7836 -> 0x3948 0x7859 -> 0x4366 0x787C -> 0x5438 + 0x7837 -> 0x394E 0x785A -> 0x436F 0x787D -> 0x5446 + 0x7838 -> 0x3963 0x785B -> 0x3159 0x787E -> 0x5460 + 0x7839 -> 0x6358 0x785C -> 0x463B 0x7921 -> 0x5622 + 0x783A -> 0x3A7A 0x785D -> 0x463E 0x7922 -> 0x563B + 0x783B -> 0x3B2E 0x785E -> 0x464B 0x7923 -> 0x563B + 0x783C -> 0x3B58 0x785F -> 0x464D 0x7924 -> 0x5642 + 0x783D -> 0x3B63 0x7860 -> 0x4653 0x7925 -> 0x5646 + 0x783E -> 0x3B71 0x7861 -> 0x4727 0x7926 -> 0x5656 + 0x783F -> 0x3C22 0x7862 -> 0x4729 0x7927 -> 0x566C + 0x7840 -> 0x3C38 0x7863 -> 0x4F4B 0x7928 -> 0x567E + 0x7841 -> 0x3C52 0x7864 -> 0x476F 0x7929 -> 0x573C + 0x7842 -> 0x3C5B 0x7865 -> 0x477A + 0x7843 -> 0x3C76 0x7866 -> 0x4837 + +So, if we total everything up, we see that GB/T 12345-90 has 2,180 +hanzi (2,118 are replacements for GB 2312-80 code points, and 62 are +additional) and 29 non-hanzi not found in GB 2312-80. + Note that the printing of the GB/T 12345-90 has some +character-form errors. The errors I am aware of are as follows: + + Code Point Description of Error + ^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^ + 0x4125 The upper-left element should be "tree" instead of + "warrior" + 0x596C The "bird" radical should not include the "fire" element + + +2.2.8: GB/T 13131-9X + + This character set is identical to GB 7589-87 (see Section +2.2.4) in terms of number of characters, but simplified hanzi are +replaced by their traditional versions. The exact number of such +substitutions is currently unknown to this author. GB/T 13131-9X is +also known as the "Third Supplementary Set" or GB3. + + +2.2.9: GB/T 13132-9X + + This character set is identical to GB 7590-87 (see Section +2.2.5) in terms of number of characters, but simplified hanzi are +replaced by their traditional versions. The exact number of such +substitutions is currently unknown to this author. GB/T 13132-9X is +also known as the "Fifth Supplementary Set" or GB5. + + +2.2.10: GB 13000.1-93 + + This document is, for all practical purposes, the Chinese +translation of ISO 10646-1:1993 (see Section 2.5.1). + + +2.2.11: ISO-IR-165:1992 + + This standard, also known as the CCITT Chinese Set, is a +variant of GB 2312-80 with the following characteristics: + +o GB 6345.1-86 modifications (including the undocumented one) and + additions, namely 132 characters (see Section 2.2.3) +o GB 8565.2-88 additions, namely 705 characters (see Section 2.2.6) +o Row 6 contains 22 background (shading) characters (06-60 through + 06-81) +o Row 12 contains 94 hanzi +o Row 13 contains 44 additional hanzi (13-51 through 13-94; fills the + row) +o Row 15 contains 1 additional hanzi (15-94) + +ISO-IR-165:1992 can therefore be considered a superset of GB 2312-80, +GB 6345.1-86, and GB 8565.2-88. This means 8,443 total characters +compared to the 7,445 in GB 2312-80, 7,577 in GB 6345.1-86, and the +8,150 in GB 8565.2-88. + + +2.2.12: OBSOLETE STANDARDS + + Most GB standards seem to be revised through other documents, +so it is hard to point to a standard and claim that it is obsolete. +The only revision I am aware of is the GB 1988-89 (the original was +named GB 1988-80). + + +2.3: CHINESE (TAIWAN) + + The sections below describe two major Taiwanese character +sets, namely Big Five and CNS 11643-1992. As you will learn they are +somewhat compatible. CCCII, also developed in Taiwan, is described in +Section 2.5.2. + + +2.3.1: BIG FIVE + + The Big Five character set is composed of 94 rows of 157 +characters each (the 157 characters of each row are encoded in an +initial group of 63 codes followed by the remaining 94 codes). The +following is a break-down of its contents: + +o Row 1: 157 symbols +o Row 2: 157 symbols +o Row 3: 94 symbols +o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63) +o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116) + +This forms what I consider to be the basic Big Five set. Actually, two +of the hanzi in Level 2 are duplicates, so there are actually only +7,650 unique hanzi in Level 2. + There are two major extensions to Big Five. The first really +has no name, and can be considered part of the basic Big Five set as +specified above. It adds the following characters: + +o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66 + uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled + digits, and 10 parenthesized digits + + The other extension was developed by a company called ETen +Information System in Taiwan, and is actually considered to be the +most widely used version of Big Five. It provides the following +extensions to Big Five (different from the above extension): + +o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase + Roman numerals, 25 classical radicals, 15 Japanese-specific symbols, + 83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic + (Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40 + fraction-like digits, and 7 symbols +o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black + box + + It is *very* important to note that while these two extensions +have many common portions (in particular, hiragana, katakana, the +Cyrillic alphabet, and so on), they do not share the same code points +for such characters. + Appendix G (pp 407-450) of "Developing International Software +for Windows 95 and Windows NT" by Nadine Kano illustrates the Big Five +character set standard by Big Five code (Microsoft calls this Code +Page 950). Code Page 950 incorporates some of the ETen extensions, +namely those in Row 89. + + +2.3.2: CNS 11643-1992 + + CNS 11643-1992 (also known as CNS 11643 X 5012), by +definition, consists of 16 planes of characters, seven of which have +character assignments. Each plane is a 94-row-by-94-cell matrix +capable of holding a total of 8,836 characters. CNS stands for +"Chinese National Standard." + CNS 11643-1992 specifies characters only in the first seven +planes. A break-down of characters, by plane, is as follows: + +o Plane 1: + - 438 symbols in Rows 1 through 6 + - 213 classical radicals in Rows 7 through 9 + - 33 graphic representations of control characters in Row 34 + - 5,401 hanzi in Rows 36 through 93 (last is 93-43) +o Plane 2: 7,650 hanzi in Rows 1 through 82 (last is 82-36) +o Plane 3: 6,148 hanzi in Rows 1 through 66 (last is 66-38) +o Plane 4: 7,298 hanzi in Rows 1 through 78 (last is 78-60) +o Plane 5: 8,603 hanzi in Rows 1 through 92 (last is 92-49) +o Plane 6: 6,388 hanzi in Rows 1 through 68 (last is 68-90) +o Plane 7: 6,539 hanzi in Rows 1 through 70 (last is 70-53) + +The total number of characters in CNS 11643-1992 is a staggering +48,711 characters, 48,027 of which are hanzi. Also note that number of +hanzi in Plane 1 is identical to Level 1 hanzi of Big Five (see +Section 2.3.1). The 2 extra hanzi in Level 2 hanzi of Big Five are +actually redundant, and are therefore not in CNS 11643-1992 Plane 2. + It is rumored that Plane 8 is currently being defined, and +will add yet more hanzi to this standard. + + +2.3.3: CNS 5205 + + This character set is Taiwan's analog to ASCII and ISO 646, +and is reportedly rarely used. How it differs from ASCII, if at all, +is unknown to this author. + + +2.3.4: OBSOLETE STANDARDS + + CNS 11643-1986 specified characters only in the first three +planes, as described in Section 2.3.2. Also, Plane 3 of CNS 11643-1992 +was called Plane 14 of CNS 11643-1986. + + +2.4: KOREAN + + The sections below describe the most current Korean character +sets, namely KS C 5636-1993, KS C 5601-1992, KS C 5657-1991, and KS C +5700-1995. "KS" stands for "Korean Standard." + + +2.4.1: KS C 5636-1993 + + This character set (published on January 6, 1993), formerly KS +C 5636-1989 (published on April 22, 1989) and sometimes referred to as +KS-Roman, is the Korean analog to ASCII and ISO 646-1991. The primary +difference is that the ASCII backslash (0x5C) is represented as a Won +symbol. + + +2.4.2: KS C 5601-1992 + + This basic Korean character set standard enumerates 8,224 +characters, 4,888 of which are hanja, and 2,350 of which are pre- +combined hangul. The hanja and hangul blocks are arranged by reading. +The following is a break-down of its contents: + +o Row 1: 94 symbols +o Row 2: 69 abbreviations and symbols +o Row 3: 94 full-width KS C 5636-1993 characters (see Section 2.4.1) +o Row 4: 94 hangul elements +o Row 5: 68 lowercase and uppercase Roman numerals and lowercase and + uppercase Greek alphabet +o Row 6: 68 line-drawing elements +o Row 7: 79 abbreviations +o Row 8: 91 phonetic symbols, circled characters, and fractions +o Row 9: 94 phonetic symbols, parenthesized characters, subscripts, + and superscripts +o Row 10: 83 hiragana +o Row 11: 86 katakana +o Row 12: 66 lowercase and uppercase Cyrillic (Russian) alphabet +o Rows 16 through 40: 2,350 pre-combined hangul (last is 40-94) +o Rows 42 through 93: 4,888 hanja (last is 93-94) + +Rows 41 and 94 are designated for user-defined characters. + There are many similarities with JIS X 0208-1990 and GB +2312-80, such as hiragana, katakana, Greek, and Cyrillic characters, +but they are assigned to different rows. + There is an interesting note about the hanja block (Rows 42 +through 93). Although there are 4,888 hanja, not all are unique. The +hanja block is arranged by reading, and in those cases when a hanja +has more than one reading, that hanja is duplicated (sometimes more +than once) in the same character set. There are 268 such cases of +duplicate hanja in KS C 5601-1992, meaning that it contains 4,620 +unique hanja. If you have a copy of the KS C 5601-1992 manual handy, +you can compare the following four code points: + + 0x6445 + 0x5162 + 0x5525 + 0x6879 + +While most of these cases involve two hanja instances, there are four +hanja that have three instances, and one (listed above) that has four! +This is the only CJK character set that has this property of +intentionally duplicating Chinese characters. See Section 4.4 for more +details. + Annex 3 of this standard defines the complete set of 11,172 +pre-combined hangul characters, also known as Johab. Johab refers to +the encoding method, and is almost like encoding all possible three- +letter words (meaning that most are nonsense). See Section 3.3.5 for +more details on Johab encoding. + + +2.4.3: KS C 5657-1991 + + This character set standard provides supplemental characters +for Korean writing, to include symbols, pre-combined hangul, and +hanja. The following is a break-down of its contents: + +o Rows 1 through 7: 613 lowercase and uppercase Latin characters with + diacritics (see note below) +o Rows 8 through 10: 273 lowercase and uppercase Greek characters with + diacritics +o Rows 11 through 13: 275 symbols +o Row 14: 27 compound hangul elements +o Rows 16 through 36: 1,930 pre-combined hangul (last is 36-50) +o Rows 37 through 54: 1,675 pre-combined hangul (last is 54-77; see + note below) +o Rows 55 through 85: 2,856 hanja (last is 85-36) + +The KS C 5657-1991 manual has a possible error (or at least an +inconsistency) for Rows 1 through 7. The manual says that there are +615 characters in that range, but I only counted 613. The difference +can be found on page 19 as the following two characters: + + Character Code Character + ^^^^^^^^^^^^^^ ^^^^^^^^^ + 0x2137 X + 0x217A TM + +An "X" doesn't belong there (it is already in KS C 5601-1992 at code +point 0x2358), and the trademark symbol is also part of KS C 5601-1992 +at code point 0x2262. This is why I feel that my count of 613 is more +accurate than what is explicitly stated in the manual on page 2. + Also, page 2 of the manual says that Rows 37 through 54 +contains 1,677 pre-combined hangul, but I only counted 1,675 (17 rows +of 94 characters plus a final row with 77 characters -- do the math +for yourself). + Here's another interesting note. My official copy of this +standard has all of its 2,856 hanja hand-written. + + +2.4.4: GB 12052-89 + + You may be asking yourself why a GB standard is listed under +the Korean section of this document. Well, there is a rather large +Korean population in China (Korea was considered part of China before +the 1890s), and they need a character set standard for communicating +using hangul. GB 12052-89 is a Korean character set standard +established by China (PRC), and enumerates a total of 5,979 +characters. + The following is the arrangement of this character set: + +o Row 1: 94 symbols +o Row 2: 72 numerals +o Row 3: 94 full-width ASCII characters +o Row 4: 83 hiragana +o Row 5: 86 katakana +o Row 6: 48 uppercase and lowercase Greek alphabet +o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet +o Row 8: 26 Pinyin and 37 Bopomofo characters +o Row 9: 76 line-drawing elements (09-04 through 09-79) +o Rows 16 through 37: 2,068 pre-combined hangul (Level 1 Hangul, Part + 1; last is 37-94) +o Rows 38 through 52: 1,356 pre-combined hangul (Level 1 Hangul, Part + 2; last is 52-40) +o Rows 53 through 71: 1,779 pre-combined hangul (Level 2 Hangul; last + is 71-87) +o Rows 71 through 72: 94 "Idu" hanja (71-89 through 72-88) + + There are a few interesting notes I can make about this +character set: + +o Rows 1 through 9 are identical to the same rows in GB 2312-80, + except that 03-04 is a dollar sign, not a Chinese Yuan (currency) + symbol. + +o The GB 12052-89 manual states on pp 1 and 3 that Rows 53 through 72 + contain 1,876 characters, but I only counted 1,873 (1,779 hangul + plus 94 hanja). + +o The total number of characters, 5,979, is correctly stated in the + manual although the hangul count is incorrect. + +o The arrangement and ordering of these hangul bear no relationship to + that of KS C 5601-1992. Both standards order by reading, which is + the only way in which they are similar. + + I am not aware to what extent this character set is being +used (and who might be using it). + + +2.4.5: KS C 5700-1995 + + Korea has developed a new character set standard called KS C +5700-1995. It is equivalent to ISO 10646-1:1993, but have pre-combined +hangul as provided (and ordered) in Unicode Version 2.0 (meaning that +all 11,172 hangul are in a contiguous block). + + +2.4.6: OBSOLETE STANDARDS + + KS C 5601-1986, KS C 5601-1987, and KS C 5601-1989 are the +same, character-set wise, to KS C 5601-1992. The 1992 edition provides +more material in the form of annexes. KS C 5601-1982, the original +version, enumerated only the 51 basic hangul elements in a one-byte 7- +and 8-bit encoding. This information is still part of KS C 5601-1992, +but in Annex 4. + There were two earlier multiple-byte standards called KS C +5619-1982 and KIPS. KS C 5619-1982 enumerated 51 hangul elements, +1,316 pre-combined hangul, and 1,672 hanja. KIPS (Korean Information +Processing System) enumerated 2,058 pre-combined hangul and 2,392 +hanja. Both have been rendered obsolete by KS C 5601-1987. + + +2.5: CJK + + The only true CJK character sets available today are CCCII, +ANSI Z39.64-1989 (also known as EACC or REACC), and ISO 10646-1:1993. +ISO 10646-1:1993 is unique in that it goes beyond CJK (Chinese +characters) to provide virtually all commonly-used alphabetic scripts. + Of these three, only ISO 10646-1:1993 is expected to gain +wide-spread acceptance. CCCII and ANSI Z39.64-1989 are still used +today, but primarily for bibliographic purposes. + + +2.5.1: ISO 10646-1:1993 + + Published by ISO (International Organization for +Standardization) in Switzerland, this character set enumerates over +34,000 characters. Its I-zone ("I" stands for "Ideograph") enumerates +approximately 21,000 Chinese characters, which is the result of a +massive effort by the CJK-JRG (CJK Joint Research Group) called "Han +Unification." The CJK-JRG is now called the IRG (Ideographic +Rapporteur Group), and is off doing additional research for future +Chinese character allocations to ISO 10646-1:1993. + The Basic Multilingual Plane (BMP) of ISO 10646-1:1993 is +equivalent to Unicode. While Unicode is comprised of a single plane of +characters (which doesn't allow much room for future expansion), ISO +10646-1:1993 contains hundreds of such planes. + One very nice feature of this standard's manual are the CJK +code correspondence tables in Section 26 (pp 262-698). Four columns +are provided for each ISO 10646-1:1993 I-zone code point -- simplified +Chinese, traditional Chinese, Japanese, and Korean. If the ISO +10646-1:1993 Chinese character maps to one of these locales, the +hexadecimal character code, (decimal) row-cell value, and glyph for +that locale is provided. The corresponding tables in Volume 2 of "The +Unicode Standard" provide character codes (sometimes the hexadecimal +character code, and sometimes the row-cell value) and a single +glyph. Quite unfortunate. I hear that a new edition of "The Unicode +Standard" is about to be released. I hope that this problem has been +addressed. + ISO 10646-1:1993 does not replace existing national character +set standards. It simply provides a single character set that is a +superset of *most* national character sets. For example, only a +fraction of the 48,027 hanzi in CNS 11643-1992 are included in ISO +10646-1:1993. I feel that it is best to think of ISO 10646-1:1993 as +"just another character set." My philosophy is to support the maximum +number of character sets and encodings as possible. + A note about ordering this standard. If you order through ANSI +in the United States, try to get an original manual. It is not easy, +though. You see, ANSI has duplication rights for ISO documents. +Photocopying Section 26 (pp 262-698) doesn't do the Chinese characters +much justice, and some characters become hard-to-read. Unfortunately, +there is no way to indicate that you want an original ISO document +through ANSI's ordering process, so some post-ordering haggling may +become necessary. + More information on ISO 10646-1:1993 can be found at the +following URL: + + http://www.unicode.org/ + + Japan, China (PRC), and Korea have developed their own +national standards that are based on ISO 10646-1:1993. They are +designated as JIS X 0221-1995 (see Section 2.1.4), GB 13000.1-93 (see +Section 2.2.10), and KS C 5700-1995 (see Section 2.4.5), respectively. + Note that these national-standard versions of Unicode are +aligned differently with its three versions: + + Unicode Version 1.0 + Unicode Version 1.1 <-> ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93 + Unicode Version 2.0 <-> KS C 5700-1995 + +One of the major changes made for Unicode Version 2.0 is the inclusion +of all 11,172 hangul. Versions 1.1 has 6,656 hangul. + + +2.5.2: CCCII + + The Chinese Character Analysis Group in Taiwan developed CCCII +(Chinese Character Code for Information Interchange) in the 1980s. +This character set is composed of 94 planes that have 94 rows and 94 +cells (94 x 94 x 94 = 830,584 characters). Furthermore, every six +planes constitute a "layer" (6 x 94 x 94 = 53,016 characters). The +following is the contents of each of the 16 layers (the 16th layer +contains only four planes): + +o Layer 1: Symbols and Traditional Chinese characters +o Layer 2: Simplified Chinese characters from PRC +o Layers 3 through 12: Variant Chinese character forms +o Layer 13: Japanese kana and kokuji (Japanese-made kanji) +o Layer 14: Korean hangul +o Layer 15: Reserved +o Layer 16: Miscellaneous characters (Japanese and Korean) + + Layers 1 through 12 have a special meaning and relationship. +The same code point in these layers is designed to hold the same +character, but with different forms. Layer 1 code points contain the +traditional character forms, Layer 2 code points contain the +simplified character forms (if any), and Layers 3 through 12 contain +variant character forms (if any). For example, given a Chinese +character with three forms, its encoding and arrangement may be as +follows: + + Character Form Code Point Layer + ^^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^ + Traditional 0x224E41 1 + Simplified 0x284E41 2 + Variant 0x2E4E41 3 + +Note how the second and third bytes (0x4E41) are identical in all +three instances -- only the first byte's value, which indicates the +layer, differs. Needless to say, this method of arrangement provides +easy access to related Chinese character forms. No wonder it is used +for bibliographic purposes. + The first layer is composed as follows: + +o Plane 1/Row 2: 56 mathematical symbols +o Plane 1/Row 3: The ASCII character set +o Plane 1/Row 11: 35 Chinese punctuation marks +o Plane 1/Rows 12 through 14: 214 classical radicals +o Plane 1/Row 15: 41 Chinese numerical symbols, 37 phonetic symbols, + and 4 tone marks +o Plane 1/Rows 16 through 67: 4,808 common Chinese characters +o Plane 1/Row 68 through Plane 3/Row 64: 17,032 less common Chinese + characters +o Plane 3/Row 65 through Plane 6/Row 5: 20,583 rare Chinese characters + +Note that Row 1 of all planes is reserved, and never assigned +characters. Take this into account when studying the above table +ranges that span planes (that is, skip Row 1). + In addition to the above, there are 11,517 simplified Chinese +characters in Layer 2 (3,625 are considered PRC simplified forms, and +the remaining 7,892 are regular simplified forms). This provides a +total of 53,940 Chinese characters. + Further information on CCCII (to include very interesting +historical notes) can be found on pp 146-149 of John Clews' "Language +Automation Worldwide: The Development of Character Set Standards" and +Chapter 6 of Huang & Huang's "An Introduction to Chinese, Japanese, +and Korean Computing." + + +2.5.3: ANSI Z39.64-1989 + + This national standard is designated as ANSI Z39.64-1989 and +named "East Asian Character Code" (EACC), but was originally known as +REACC (RLIN East Asian Character Code), that is, before it became a +national standard. RLIN stands for "Research Libraries Information +Network," which was developed by the Research Libraries Group (RLG) +located in Mountain View, California. + RLG's Home Page is at the following URL: + + http://www.rlg.org/ + + The structure of ANSI Z39.64-1989 is based on CCCII, but with +a few differences. Many consider it to be superior to and a +replacement for CCCII (see Section 2.5.2). + The ANSI Z39.64-1989 standard is available through ANSI, but +you should be aware that it is distributed in the form of several +microfiche. Not a terribly useful storage medium these days. I had my +set tranformed into tangible printed pages. You can also obtain this +standard through NISO (National Information Standards Organization) +Press Fulfillment. Their URL is: + + http://www.niso.org/ + + EACC has been designated by the Library of Congress as a +character set for use in USMARC (United States MAchine-Readable +Cataloging) records, and is used extensively by East Asian libraries +across North America. + EACC is also being used in Australia for the National CJK +Project. Check out the following URL for more details: + + http://www.nla.gov.au/1/asian/ncjk/cjkhome.html + + Further information on ANSI Z39.64-1989 (to include very +interesting historical notes) can be found on pp 150-156 of John +Clews' "Language Automation Worldwide: The Development of Character +Set Standards" (although a source at RLG tells me that some of Clews' +facts are wrong) and Chapter 6 of Huang & Huang's "An Introduction to +Chinese, Japanese, and Korean Computing." + The authoritative paper on EACC is "RLIN East Asian Character +Code and the RLIN CJK Thesaurus" by Karen Smith Yoshimura and Alan +Tucker, published in "Proceedings of the Second Asian-Pacific +Conference on Library Science," May 20-24,1985, Seoul, Korea. + + +2.6: OTHER + + This section includes character set standards that don't +properly fall under the above sections. + + +2.6.1: GB 8045-87 + + GB 8045-87 is a Mongolian character set standard established +by China (PRC). This standard enumerates 94 Mongolian characters. Of +these 94 characters, 12 are punctuation (vertically-oriented), and the +remaining 82 are characters specific to the Mongolian script. +Mongolian is written vertically like Chinese. + I do not discuss the encoding for GB 8045-87 in Part 3, so +will do it here. The GB 8045-87 manual describes a 7- and 8-bit +encoding. The 7-bit encoding puts these 94 characters in the standard +ASCII printable range, namely 0x21 through 0x7E. Code point 0x20 is +marked as "MSP" which stands for "Mongolian space." The 8-bit encoding +puts these 94 characters in the range 0xA1 through 0xFE, with the +"MSP" character at code point 0xA0. The GB 1988-89 set is then encoded +in the range 0x21 through 0x7E. + + +2.6.2: TCVN-5773:1993 + + TCVN-5773:1993 (also called NSCII, which is short for Nom +Standard Code for Information Interchange) is the Vietnamese analog to +ISO 10646-1:1993, but adds 1,775 Vietnamese-specific Chinese +characters. These 1,775 characters are encoded in the range 0xA000 +through 0xA6EE. + More information on TCVN-5773:1993 can be found at the +following URL: + + ftp://unicode.org/pub/MappingTables/EastAsiaMaps/ + +There are two files at the above URL that pertain to this standard. +The first is a README, and the second is a Macintosh HyperCard stack +(requires HyperCard): + + TCVN-NSCII.README + TCVN-NSCIIstack_1.0.sea.hqx + + +PART 3: CJK ENCODING SYSTEMS + + These sections describe the various systems for encoding the +character set standards listed in Part 2. The first two described, +7-bit ISO 2022 and EUC, are not specific to a locale, and in some +cases not specific to CJK. + The CJK Character Set Server at the following URL can generate +character sets based on encodings described in this section: + + http://jasper.ora.com/lunde/cjk-char.html + +I suggest that you use this as a way to obtain files that illustrate +these encodings in action. + But first, please take a peek at the following table, which is +an attempt to illustrate how two Chinese characters (that stand for +"kanji/hanzi/hanja") are encoded using the various methods presented +in the following sections (character codes as hexadecimal digits, and +escape sequences or shift sequences as printable characters): + +o Japanese (JIS X 0208-1990 & JIS X 0201-1976): + - 7-bit ISO 2022 <ESC> & @ <ESC> $ B 0x3441 0x3B7A <ESC> ( J + - ISO-2022-JP <ESC> $ B 0x3441 0x3B7A <ESC> ( J + - EUC 0xB4C1 0xBBFA + - Shift-JIS 0x8ABF 0x8E9A + +o Simplified Chinese (GB 2312-80 & GB 1988-89 or ASCII): + - 7-bit ISO 2022 <ESC> $ A 0x3A3A 0x5756 <ESC> ( T + - ISO-2022-CN <ESC> $ ) A <SO> 0x3A3A 0x5756 <SI> + - EUC 0xBABA 0xD7D6 + - HZ (HZ-GB-2312) ~{ 0x3A3A 0x5756 ~} + - zW zW 0x3A3A 0x5756 + +o Traditional Chinese (CNS 11643-1992): + - 7-bit ISO 2022 <ESC> $ ( G 0x6947 0x4773 <ESC> ( B + - ISO-2022-CN <ESC> $ ) G <SO> 0x6947 0x4773 <SI> + - EUC 0xE9C7 0xC7F3 or 0x8EA1E9C7 0x8EA1C7F3 + +o Traditional Chinese (Big Five): + - Big Five 0xBA7E 0xA672 + +o Korean (KS C 5601-1992 & ASCII): + - 7-bit ISO 2022 <ESC> $ ( C 0x7953 0x6D2E <ESC> ( B + - ISO-2022-KR <ESC> $ ) C <SO> 0x7953 0x6D2E <SI> + - EUC 0xF9D3 0xEDAE + - Johab 0xF7D3 0xF1AE + +o CJK (ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93, or KS C + 5700-1995): + - UCS-2 0x6F22 0x5B57 + - UCS-4 0x00006F22 0x00005B57 + +The above should have given you a taste of what information the +following sections provide. + + +3.1: 7-BIT ISO 2022 ENCODING + + 7-bit ISO 2022 is the name commonly given to the encoding +system that uses escape sequences to shift between character sets. +(ISO 2022 encoded Japanese text is also known as "JIS" encoding, but +is different from ISO-2022-JP and ISO-2022-JP-2, and will be explained +in Section 3.1.3.) This encoding comes from the ISO 2022-1993 +standard. + An escape sequence, as the name implies, consists of an escape +character followed by a sequence of one or more characters. These +escape sequences are used to change character set of the text +stream. This may also mean a shift from one- to two-byte-per-character +mode (or vice versa). + 7-bit ISO 2022 Character sets fall into two types: one-byte +and two-byte. CJK character sets, for obvious reasons, fall into the +latter group. + One advantage that 7-bit ISO 2022 encoding has over other +encoding systems is that its escape sequences specify the character +set, thus specify the locale. 7-bit ISO 2022 encoding also encodes +text using only seven-bit bytes, which has the benefit of being able +to survive Internet travel (e-mail). + + +3.1.1: CODE SPACE + + Each byte in the representation of graphic (printable) +characters fall into the range 0x21 (decimal 33) through 0x7E (decimal +126). For one-byte character sets, this means a maximum of 94 +characters. For two-byte character sets, this means a maximum of 8,836 +characters (94 x 94 = 8,836). + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x21-0x7E + + Two-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x21-0x7E + second byte range 0x21-0x7E + +White space and control characters (of which the "escape" character is +one) are still found in 0x00-0x20 and 0x7F. + + +3.1.2: ISO-REGISTERED ESCAPE SEQUENCES + + The following is a table that provides the ISO-registered +escape sequences for various one- and two-byte character sets +mentioned in Part 2 of this document (ISO registration numbers +provided in the fourth column): + + One-byte Character Set Escape Sequence Hexadecimal ISO Reg + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^ + ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842 6 + Half-width katakana <ESC> ( I 0x1B2849 13 + JIS X 0201-1976 Roman <ESC> ( J 0x1B284A 14 + GB 1988-89 Roman <ESC> ( T 0x1B2854 57 + + Two-byte Character Set Escape Sequence Hexadecimal ISO Reg + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^ + JIS C 6226-1978 <ESC> $ @ 0x1B2440 42 + GB 2312-80 <ESC> $ A 0x1B2441 58 + JIS X 0208-1983 <ESC> $ B 0x1B2442 87 + KS C 5601-1992 <ESC> $ ( C 0x1B242843 149 + JIS X 0212-1990 <ESC> $ ( D 0x1B242844 159 + ISO-IR-165:1992 <ESC> $ ( E 0x1B242845 165 + JIS X 0208-1990 <ESC> & @ <ESC> $ B 0x1B26401B2442 168 + CNS 11643-1992 Plane 1 <ESC> $ ( G 0x1B242847 171 + CNS 11643-1992 Plane 2 <ESC> $ ( H 0x1B242848 172 + CNS 11643-1992 Plane 3 <ESC> $ ( I 0x1B242849 183 + CNS 11643-1992 Plane 4 <ESC> $ ( J 0x1B24284A 184 + CNS 11643-1992 Plane 5 <ESC> $ ( K 0x1B24284B 185 + CNS 11643-1992 Plane 6 <ESC> $ ( L 0x1B24284C 186 + CNS 11643-1992 Plane 7 <ESC> $ ( M 0x1B24284D 187 + +Note that the first four two-byte character sets do not use an opening +parenthesis (0x28 or "(") in their escape sequences, which means that +they don't follow the 7-bit ISO 2022 rules precisely. They are shorter +for historical reasons, and are retained for backwards compatibility. +Also note that not all of the CJK character set standards described in +Part 2 have ISO-registered escape sequences. + There are other encoding methods that are similar to 7-bit ISO +2022 in that they are suitable for Internet use, but are locale- +specific. These include HZ and zW encoding, both of which are specific +to the GB 2312-80 character set (see Sections 3.3.2 and 3.3.3). +ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, and ISO-2022-CN-EXT are +described below. + + +3.1.3: ISO-2022-JP AND ISO-2022-JP-2 + + ISO-2022-JP is best described as a subset of 7-bit ISO 2022 +encoding for Japanese, and reflects how Japanese text is encoded for +e-mail messages. ISO-2022-JP-2 is an extension that supports +additional character sets. + There are only four escape sequences permitted in ISO-2022-JP, +indicated as follows: + + One-byte Character Set Escape Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842 + JIS X 0201-1976 Roman <ESC> ( J 0x1B284A + + Two-byte Character Set Escape Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + JIS C 6226-1978 <ESC> $ @ 0x1B2440 + JIS X 0208-1983 <ESC> $ B 0x1B2442 + +Note the lack of JIS X 0208-1990, JIS X 0212-1990, and half-width +katakana escape sequences. The JIS X 0208-1983 escape sequence is used +to indicate both JIS X 0208-1983 and JIS X 0208-1990 (for practical +reasons). + ISO-2022-JP-2 permits additional escape sequences, indicated +as follows: + + One-byte Character Set Escape Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842 + JIS X 0201-1976 Roman <ESC> ( J 0x1B284A + + Two-byte Character Set Escape Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + JIS C 6226-1978 <ESC> $ @ 0x1B2440 + JIS X 0208-1983 <ESC> $ B 0x1B2442 + JIS X 0212-1990 <ESC> $ ( D 0x1B242844 + GB 2312-80 <ESC> $ A 0x1B2441 + KS C 5601-1992 <ESC> $ ( C 0x1B242843 + +With the introduction of ISO-2022-KR (see Section 3.1.4), ISO-2022-CN +(see Section 3.1.5), and ISO-2022-CN-EXT (see Section 3.1.5), the +usefulness of supporting GB 2312-80 and KS C 5601-1992 can be +questioned. However, ISO-2022-JP-2 provides support for JIS X +0212-1990. + More detailed information on ISO-2022-JP encoding can be found +in RFC 1468. And, more detailed information on ISO-2022-JP-2 encoding +can be found in RFC 1554. + + +3.1.4: ISO-2022-KR + + ISO-2022-KR is similar to ISO-2022-JP (see Section 3.1.3) in +that it reflects how Korean text is encoded for e-mail messages. +However, its actual implementation is a bit different. Below is a +summary. + There are only two shift sequences used in ISO-2022-KR, +indicated as follows: + + One-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + ASCII (ANSI X3.4-1986) <SI> 0x0F + + Two-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + KS C 5601-1992 <SO> 0x0E + +Furthermore, the following designator sequence must appear only once, +at the beginning of a line, before any KS C 5601-1992 characters (this +usually means that it appears by itself on the first line of the +file): + + <ESC> $ ) C 0x1B242943 + +It almost looks the same as the KS C 5601-1992 escape sequence in +7-bit ISO 2022, but look again. The opening parenthesis (0x28 or "(") +is replaced by a closing parenthesis (0x29 or ")"). This designator +sequence serves a different purpose than an escape sequence. It is +like a flag indicating that "this document contains KS C 5601-1992 +characters." The <SO> and <SI> control characters actually perform the +switching between one- (ASCII) and two-byte (KS C 5601-1992) codes. + More detailed information on ISO-2022-KR encoding can be found +in RFC 1557. + + +3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT + + ISO-2022-CN and ISO-2022-CN-EXT are similar to ISO-2022-JP +(see Section 3.1.3) and ISO-2022-KR (see Section 3.1.4) in that they +reflect how Chinese text is encoded for e-mail messages. + Like with ISO-2022-KR, there are only two shift sequences, +indicated as follows: + + One-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + ASCII (ANSI X3.4-1986) <SI> 0x0F + + Two-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + <Too Many to List> <SO> 0x0E + +But, unlike ISO-2022-KR, there are single shift sequences. Single +shift means that they are used before every (single) character, not +before sequences of characters. + + Single Shift Type Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + SS2 <ESC> N 0x1B4E + SS3 <ESC> O (not zero!) 0x1B4F + + ISO-2022-CN supports the following character sets using SO and +SS2 designations: + + Character Set Type Designation Sequence Hexadecimal + ^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + GB 2312-80 SO <ESC> $ ) A 0x1B242941 + CNS 11643-1992 Plane 1 SO <ESC> $ ) G 0x1B242947 + CNS 11643-1992 Plane 2 SS2 <ESC> $ * H 0x1B242A48 + +The designator sequences must appear once on a line before any +instance of the character set it designates. If two lines contain +characters from the same character set, both lines must include the +designator sequence (this is so the text can be displayed correctly +when scroll back in a window). This is different behavior from +ISO-2022-KR where the designator sequence appears once in the entire +file (this is because ISO-2022-KR supports a single two-byte character +set). + ISO-2022-CN-EXT supports the following character sets using +SO, SS2, and SS3 designations (notice how ISO-2022-CN is still +supported in the same manner): + + Character Set Type Designation Sequence Hexadecimal + ^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^ + GB 2312-80 SO <ESC> $ ) A 0x1B242941 + GB/T 12345-90 SO NOT REGISTERED + ISO-IR-165 SO <ESC> $ ) E 0x1B242945 + CNS 11643-1992 Plane 1 SO <ESC> $ ) G 0x1B242947 + CNS 11643-1992 Plane 2 SS2 <ESC> $ * H 0x1B242A48 + GB 7589-87 SS2 NOT REGISTERED + GB/T 13131-9X SS2 NOT REGISTERED + CNS 11643-1992 Plane 3 SS3 <ESC> $ + I 0x1B242B49 + CNS 11643-1992 Plane 4 SS3 <ESC> $ + J 0x1B242B4A + CNS 11643-1992 Plane 5 SS3 <ESC> $ + K 0x1B242B4B + CNS 11643-1992 Plane 6 SS3 <ESC> $ + L 0x1B242B4C + CNS 11643-1992 Plane 7 SS3 <ESC> $ + M 0x1B242B4D + GB 7590-87 SS3 NOT REGISTERED + GB/T 13132-9X SS3 NOT REGISTERED + +Support for character sets indicated as NOT REGISTERED will be added +once they are ISO-registered. + More detailed information on ISO-2022-CN and ISO-2022-CN-EXT +encodings can be found in RFC 1922. + + +3.2: EUC ENCODING + + EUC stands for "Extended UNIX Code," and is a rich encoding +system from ISO 2022-1993 that is designed to handle large or multiple +character sets. It is primarily used on UNIX systems, such as Sun's +Solaris. + EUC consists of four codes sets, numbered 0 through 3. The +only code set that is more or less fixed by definition is code set 0, +which is specified to contain ASCII or a locale's equivalent (such as +JIS X 0201-1976 for Japanese or GB 1988-89 for PRC Chinese). + It is quite common to append the locale name to "EUC" when +designating a specific instance of EUC encoding. Common designations +include EUC-JP, EUC-CN, EUC-KR, and EUC-TW. + + +3.2.1: JAPANESE REPRESENTATION + + The following table illustrates the Japanese representation of +EUC packed format: + + EUC Code Sets Encoding Range + ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Code set 0 (ASCII or JIS X 0201-1976 Roman): 0x21-0x7E + Code set 1 (JIS X 0208): 0xA1A1-0xFEFE + Code set 2 (half-width katakana): 0x8EA1-0x8EDF + Code set 3 (JIS X 0212-1990): 0x8FA1A1-0x8FFEFE + +An earlier version of EUC for Japanese used code set 3 as the user- +defined range. + + +3.2.2: CHINESE (PRC) REPRESENTATION + + The following table illustrates the Chinese (PRC) +representation of EUC packed format: + + EUC Code Sets Encoding Range + ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Code set 0 (ASCII or GB 1988-89): 0x21-0x7E + Code set 1 (GB 2312-80): 0xA1A1-0xFEFE + Code set 2: unused + Code set 3: unused + +Note how code sets 2 and 3 are unused. + The encoding used on Macintosh is quite similar, but has a +shortened two-byte range (0xA1A1 through 0xFCFE) plus additional +one-byte code points, namely 0x80 ("u" with dieresis), 0xFD +("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM" +as a superscript), and 0xFF ("ellipsis" symbol: three dots). + + +3.2.3: CHINESE (TAIWAN) REPRESENTATION + + The following table illustrates the Chinese (Taiwan) +representation of EUC packed format: + + EUC Code Sets Encoding Range + ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Code set 0 (ASCII): 0x21-0x7E + Code set 1 (CNS 11643-1992 Plane 1): 0xA1A1-0xFEFE + Code set 2 (CNS 11643-1992 Planes 1-16): 0x8EA1A1A1-0x8EB0FEFE + Code set 3: unused + +Note how CNS 11643-1992 Plane 1 is redundantly encoded in code set 1 +(two-byte) and code set 2 (four-byte). The second byte of code set 2 +indicates the plane number. For example, 0xA1 is Plane 1 and so on up +until 0xB0, which is Plane 16. + + +3.2.4: KOREAN REPRESENTATION + + The following table illustrates the Korean representation of +EUC packed format (this is also known as "Wansung" encoding -- the +Korean word "wansung" means "pre-compose"): + + EUC Code Sets Encoding Range + ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Code set 0 (ASCII or KS C 5636-1993): 0x21-0x7E + Code set 1 (KS C 5601-1992): 0xA1A1-0xFEFE + Code set 2: unused + Code set 3: unused + +Note how code sets 2 and 3 are unused. + The encoding used on Macintosh is quite similar, but has a +shortened two-byte range (0xA1A1 through 0xFDFE) plus additional +one-byte code points, namely 0x81 ("won" symbol), 0x82 (hyphen), 0x83 +("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM" +as a superscript), and 0xFF ("ellipsis" symbol: three dots). + See Section 3.3.17 for a description of Microsoft's extension +to this encoding, called Unified Hangul Code. + + +3.3: LOCALE-SPECIFIC ENCODINGS + + The encoding systems described in the following sections are +considered to be locale-specific, namely that are used to encode a +specific character set standard. This is not to say that they are not +widely used (actually, some of these are among the most widely used +encoding systems!), but rather that they are tied to a specific +character set. + + +3.3.1: SHIFT-JIS + + Shift-JIS (also known as MS Kanji, SJIS, or DBCS-PC) is the +encoding system used on machines that support MS-DOS or Windows, and +also for Macintosh (KanjiTalk or Japanese Language Kit). It was +originally developed by Microsoft Corporation as a way to support the +Japanese character set on MS-DOS. The following tables provide the +Shift-JIS encoding ranges: + + Two-byte Standard Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte ranges 0x81-0x9F, 0xE0-0xEF + second byte ranges 0x40-0x7E, 0x80-0xFC + + Two-byte User-defined Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0xF0-0xFC + second byte ranges 0x40-0x7E, 0x80-0xFC + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Half-width katakana 0xA1-0xDF + ASCII/JIS-Roman 0x21-0x7E + +It is important to note that the user-defined range does not +correspond to code points in other encodings that support Japanese, +such as 7-bit ISO 2022 or EUC. This is a portability problem. It is +also unique in that it does not support the JIS X 0212-1990 character +set standard. + The encoding used on Macintosh is quite similar to the above +table, but has additional one-byte code points, namely 0x80 +(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE +("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis" +symbol: three dots). + + +3.3.2: HZ (HZ-GB-2312) + + HZ is a simple yet very powerful and reliable system for +encoding GB 2312-80 text which was developed by Fung Fung Lee +(lee@umunhum.stanford.edu). HZ encoding is commonly used when +exchanging e-mail or posting messages to Usenet News (specifically, to +alt.chinese.text). + The actual encoding ranges used for one- and two-byte +characters is almost identical to 7-bit ISO 2022 encoding (see Section +3.1.1). The first-byte range is limited to 0x21 through 0x77. But, +instead of using an escape sequence to shift between one- and two-byte +character modes, a simple string of two printable characters is used. + + One-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + ASCII ~} 0x7E7D + + Two-byte Character Set Shift Sequence Hexadecimal + ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^ + GB 2312-80 ~{ 0x7E7B + +The tilde character (0x7E) is interpreted as an escape character in HZ +encoding, so it has special meaning. If a tilde character is to appear +in one-byte-per-character mode, it must be doubled (so ~~ would appear +as just ~). This means that there are three escape sequences used in +HZ encoding: + + Escape Sequence Meaning + ^^^^^^^^^^^^^^^ ^^^^^^^ + ~~ ~ in one-byte-per-character mode + ~} Shift into one-byte-per-character mode + ~{ Shift into two-byte-per-character mode + +There is also a fourth escape sequence, namely ~ plus a newline +character (~\n). This escape sequence is a line-continuation marker to +be consumed with no output produced. + This method works without problems because the shift sequences +represent empty positions in the very last row of the GB 2312-80 table +(actually, the second- and third-from-last code points). HZ encoding +makes 77 of the 94 rows accessible, and because there are no defined +characters beyond row 77, this causes no problems. + The complete HZ specification is part of the HZ package, +described in RFC 1843, and available in HTML format. These are +available at the following URLs: + + ftp://ftp.ifcss.org/pub/software/unix/convert/HZ-2.0.tar.gz + ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/rfc-1843.txt + http://umunhum.stanford.edu/~lee/chicomp/HZ_spec.html + +In addition, RFC 1842 establishes "HZ-GB-2312" as the "charset" +parameter in MIME-encoded e-mail headers. Its properties are identical +to HZ encoding as described in RFC 1843. + + +3.3.3: zW + + zW encoding, developed by Ya-Gui Wei and Edmund Lai, is older +than and somewhat similar to HZ encoding (HZ is considered to be a +better encoding system, and users are encouraged to switch over to HZ +encoding). + zW encoding is named by how it encodes each line of GB 2312-80 +text, namely lines that contain Chinese text must begin with the two +characters "z" and "W" ("zW"). This encoding method does not permit +the mixture of one- (ASCII) and two-byte (GB 2312-80) characters on a +per-character basis, but rather on a per-line basis. That is, each +line can contain only Chinese or ASCII text, but not both. + More information on zW encoding can be found as part of the +ZWDOS package available at the following URL: + + ftp://ftp.ifcss.org/pub/software/dos/ZWDOS/ + + +3.3.4: BIG FIVE + + Big Five is the encoding system used on machines that support +MS-DOS or Windows, and also for Macintosh (such as the Chinese +Language Kit or the fully-localized operating system). + + Two-byte Standard Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0xA1-0xFE + second byte ranges 0x40-0x7E, 0xA1-0xFE + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + ASCII 0x21-0x7E + + The encoding used on Macintosh is quite similar to the above, +but has a slightly shortened two-byte range (second byte range up to +0xFC only) plus additional one-byte code points, namely 0x80 +(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE +("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis" +symbol: three dots). + + +3.3.5: JOHAB + + Korean hangul characters are typically encoded in what is +known as pre-combined form, namely 2 or 3 hangul elements bound into a +single character. KS C 5601-1992 enumerates 2,350 such pre-combined +forms. While this number is felt to be sufficient for most purposes, +it does not account for the total number of possible permutations. The +encoding system that encodes all possible pre-combined hangul is known +as Johab encoding (also known as "two-byte combination code" -- the +Korean word "johab" means "combine"), and is described in Annex 3 of +the KS C 5601-1992 standard. This encoding is almost like encoding all +possible three-letter words in English -- while all combinations are +possible, only a fraction represent *real* words. + Pre-combined hangul can be composed of 19 different initial, +21 different medial, and 27 different final hangul elements (28, +actually, if you count the placeholder). This provides a maximum of +11,172 pre-combined hangul. Of these 67 hangul elements, 51 are unique +(some can occur in different positions). Each of these positions are +encoded using five bits each (five bits can encode up to 32 unique +objects). The encoding array looks as follows: + +o Bit 1: always on +o Bits 2-6: initial hangul element +o Bits 7-11: medial hangul element +o Bits 12-16: final hangul element + +Initial and final elements are consonants, and the medial elements are +vowels. This encoding must be treated as a 16-bite entity because the +bit array of the medial hangul element spans the first and second byte. + Johab encoding also provides the complete set of KS C 5601- +1992 symbols and hanja, but in different code points. Annex 3 of the +KS C 5601-1992 manual (pp 33-34) contains a complete symbol and hanja +mapping table between EUC and Johab code points. (The KS C 5601-1989 +manual did not have this.) The code space ranges for Johab encoding +are as follows: + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + ASCII or KS C 5636-1993 0x21-0x7E + + Two-byte Pre-combined Hangul Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0x84-0xD3 + second byte ranges 0x41-0x7E, 0x81-0xFE + + Two-byte Symbols and Hanja Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte ranges 0xD8-0xDE, 0xE0-0xF9 + second byte ranges 0x31-0x7E, 0x91-0xFE + +Note that the second byte ranges encode a total of 188 characters, and +that the second byte ranges for hangul and symbols/hanja are slightly +different (yet the same size, namely 188 characters). + Here is a summary of the above table, which better describes +what is encoded where. Rows 0x84 through 0xD3 provide 80 rows of 188 +characters each (15,040 code points, which is more than enough for the +11,172 pre-combined hangul). Row 0xD8 provides 188 user-defined +positions, the same as Rows 41 and 94 in the standard KS C 5601-1992 +table. Rows 0xD9 through 0xDE encode Rows 1 through 12 of the standard +KS C 5601-1992 table (symbols). Rows 0xE0 through 0xF9 encode Rows 42 +through 94 of the KS C 5601-1992 table (hanja). The following URL +provides a complete mapping table for the KS C 5601-1992 symbols and +hanja: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt + +The following URLs provides similar information (they are the same +file), but only for the 11,172 pre-combined hangul: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt + ftp://unicode.org/pub/MappingTables/EastAsiaMaps/hangul-codes.txt + + Of further interest may be that Microsoft designates Johab +encoding as its Code Page 1361. Microsoft if planning to support Johab +encoding for Korean Windows NT. + + +3.3.6: N-BYTE HANGUL + + In the days before full two-byte capable operating systems, +each of the 51 basic hangul elements were encoding using a single +(7-bit) byte. The encoding range spans 0x40 through 0x7C, but there +are several unassigned gaps. This is known as the "N-byte Hangul" +code, and is described in Annex 4 (page 35) of the KS C 5601-1992 +manual. + The following table illustrates these 51 one-byte code points +(the pronunciation or meaning of the hangul element is provided in +parentheses) and how they map to the three 5-bit arrays in Johab +encoding (expressed as binary patterns): + + Element Initial Medial Final + ^^^^^^^ ^^^^^^^ ^^^^^^ ^^^^^ + 0x40 ("fill") 00001 00010 00001 + 0x41 (g) 00010 ***** 00010 + 0x42 (gg) 00011 ***** 00011 + 0x43 (gs) ***** ***** 00100 + 0x44 (n) 00100 ***** 00101 + 0x45 (nj) ***** ***** 00110 + 0x46 (nh) ***** ***** 00111 + 0x47 (d) 00101 ***** 01000 + 0x48 (dd) 00110 ***** ***** + 0x49 (r) 00111 ***** 01001 + 0x4A (rg) ***** ***** 01010 + 0x4B (rm) ***** ***** 01011 + 0x4C (rb) ***** ***** 01100 + 0x4D (rs) ***** ***** 01101 + 0x4E (rt) ***** ***** 01110 + 0x4F (rp) ***** ***** 01111 + 0x50 (rh) ***** ***** 10000 + 0x51 (m) 01000 ***** 10001 + 0x52 (b) 01001 ***** 10011 + 0x53 (bb) 01010 ***** ***** + 0x54 (bs) ***** ***** 10100 + 0x55 (s) 01011 ***** 10101 + 0x56 (ss) 01100 ***** 10110 + 0x57 (ng) 01101 ***** 10111 + 0x58 (j) 01110 ***** 11000 + 0x59 (jj) 01111 ***** ***** + 0x5A (c) 10000 ***** 11001 + 0x5B (k) 10001 ***** 11010 + 0x5C (t) 10010 ***** 11011 + 0x5D (p) 10011 ***** 11100 + 0x5E (h) 10100 ***** 11101 + 0x5F UNASSIGNED + 0x60 UNASSIGNED + 0x61 UNASSIGNED + 0x62 (a) ***** 00011 ***** + 0x63 (ae) ***** 00100 ***** + 0x64 (ya) ***** 00101 ***** + 0x65 (yae) ***** 00110 ***** + 0x66 (eo) ***** 00111 ***** + 0x67 (e) ***** 01010 ***** + 0x68 UNASSIGNED + 0x69 UNASSIGNED + 0x6A (yeo) ***** 01011 ***** + 0x6B (ye) ***** 01100 ***** + 0x6C (o) ***** 01101 ***** + 0x6D (wa) ***** 01110 ***** + 0x6E (wae) ***** 01111 ***** + 0x6F (oe) ***** 10010 ***** + 0x70 UNASSIGNED + 0x71 UNASSIGNED + 0x72 (yo) ***** 10011 ***** + 0x73 (u) ***** 10100 ***** + 0x74 (weo) ***** 10101 ***** + 0x75 (we) ***** 10110 ***** + 0x76 (wi) ***** 10111 ***** + 0x77 (yu) ***** 11010 ***** + 0x78 UNASSIGNED + 0x79 UNASSIGNED + 0x7A (eu) ***** 11011 ***** + 0x7B (yi) ***** 11100 ***** + 0x7C (i) ***** 11101 ***** + + There are utilities to convert N-byte Hangul code to other, +more widely-used, encoding methods. Pointers to these and other code +conversion utilities can be found in Section 4.7. + + +3.3.7: UCS-2 + + UCS-2 (Universal Character Set containing 2 bytes) encoding is +one way to encode ISO 10646-1:1993 text, and is considered identical +to Unicode encoding. Its encoding range, which is quite simple, is as +follows: + + ISO 10646-1:1993 Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x00-0xFF + second byte range 0x00-0xFF + +Yes, folks, the whole range of 65,536 possible code points are +available for encoding characters. The "signature" that indicates a +file using UCS-2 is as follows: + + 0xFEFF + + Escape sequences for UCS-2 have already been registered with +ISO, and are as follows: + + ISO 10646-1:1993 Escape Sequence Hexadecimal ISO Reg + ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^ + UCS-2 Level 1 <ESC> % / @ 0x1B252F40 162 + UCS-2 Level 2 <ESC> % / C 0x1B252F43 174 + UCS-2 Level 3 <ESC> % / E 0x1B252F45 176 + +So what do these three levels mean? Level 3 means all characters in +ISO 10646-1:1993 with no restrictions (0x0000 through 0xFFFF). + Level 2 begins to restrict the character set by not including +the following characters or character ranges: + + 0x0300-0x0345 0x09D7 0x0BD7 0x11A8-0x11F9 + 0x0360-0x0361 0x0A3C 0x0C55-0x0C56 0x20D0-0x20E1 + 0x0483-0x0486 0x0A70-0x0A71 0x0CD5-0x0CD6 0x302A-0x302F + 0x093C 0x0ABC 0x0D57 0x3099-0x309A + 0x0953-0x0954 0x0B3C 0x1100-0x1159 0xFE20-0xFE23 + 0x09BC 0x0B56-0x0B57 0x115F-0x11A2 + +These are all combining characters, and represent 364 code points. + Level 1 further restricts the character set by not including +the following characters or character ranges: + + 0x05B0-0x05B9 0x09BE-0x09C4 0x0B47-0x0B48 0x0D02-0x0D03 + 0x05BB-0x05BD 0x09C7-0x09C8 0x0B4B-0x0B4D 0x0D3E-0x0D43 + 0x05BF 0x09CB-0x09CD 0x0B82-0x0B83 0x0D46-0x0D48 + 0x05C1-0x05C2 0x09E2-0x09E3 0x0BBE-0x0BC2 0x0D4A-0x0D4D + 0x064B-0x0652 0x0A02 0x0BC6-0x0BC8 0x0E31 + 0x0670 0x0A3E-0x0A42 0x0BCA-0x0BCD 0x0E34-0x0E3A + 0x06D6-0x06E4 0x0A47-0x0A48 0x0C01-0x0C03 0x0E47-0x0E4E + 0x06E7-0x06E8 0x0A4B-0x0A4D 0x0C3E-0x0C44 0x0EB1 + 0x06EA-0x06ED 0x0A81-0x0A83 0x0C46-0x0C48 0x0EB4-0x0EB9 + 0x0901-0x0903 0x0ABE-0x0AC5 0x0C4A-0x0C4D 0x0EBB-0x0EBC + 0x093E-0x094D 0x0AC7-0x0AC9 0x0C82-0x0C83 0x0EC8-0x0ECD + 0x0951-0x0952 0x0ACB-0x0ACD 0x0CBE-0x0CC4 0xFB1E + 0x0962-0x0963 0x0B01-0x0B03 0x0CC6-0x0CC8 + 0x0981-0x0983 0x0B3E-0x0B43 0x0CCA-0x0CCD + +These, too, are all combining characters, and represent 586 code +points (222 above plus the 364 characters from the Level 2 +restriction). + + +3.3.8: UCS-4 + + UCS-4 (Universal Character Set containing 4 bytes) encoding is +another way to encode ISO 10646-1:1993 text, and is used for future +expansion of the character set. Its encoding range is as follows: + + ISO 10646-1:1993 Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x00-0x7F + second byte range 0x00-0xFF + third byte range 0x00-0xFF + fourth byte range 0x00-0xFF + +Note that the first byte range only goes up to 0x7F. This means that +UCS-4 is a 31-bit encoding. And, in case you're wondering, 31 bits +provide 2,147,483,648 code points. The "signature" that indicates a +file using UCS-4 is as follows: + + 0x0000 0xFEFF + + Escape sequences for UCS-4 have already been registered with +ISO, and are as follows: + + ISO 10646-1:1993 Escape Sequence Hexadecimal ISO Reg + ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^ + UCS-4 Level 1 <ESC> % / A 0x1B252F41 163 + UCS-4 Level 2 <ESC> % / D 0x1B252F44 175 + UCS-4 Level 3 <ESC> % / F 0x1B252F46 177 + +See the end of Section 3.3.7 for a description of these three levels. +But, in the case of UCS-4, simply prepend "0000" to all the values. + + +3.3.9: UTF-7 + + It turns out that *raw* ISO 10646-1:1993 encoding (that is, +UCS-2 or UCS-4) can cause problems because null bytes (0x00) are +possible (and frequent). Several UTFs (UCS Transformation Formats) +have been developed to deal with this and other problems. I must admit +that I don't know too much about UTFs, and what I provide below is +minimal, but does include pointers to more complete descriptions. + UTF-7 is a mail-safe 7-bit transformation format for UCS-2 +(including UTF-16). It uses straight ASCII for many ASCII characters, +and switches into a Base64 encoding of UCS-2 or UTF-16 for everything +else. It was designed to be usable in MIME-compliant e-mail headers as +well as message bodies, and to pass through gateways to non-ASCII mail +systems (like Bitnet). More detailed information on UTF-7 can be found +in RFC 1642, and a UTF-7 converter is available. The following URLs +provide this information: + + http://www.stonehand.com/unicode/standard/utf7.html + ftp://unicode.org/pub/Programs/ConvertUTF/ + + +3.3.10: UTF-8 + + UTF-8 (also known as UTF-2 or FSS-UTF -- FSS stands for "file +system safe") can represent any character in UCS-2 and UCS-4, and is +officially an annex to ISO 10646-1:1993. It is different from UTF-7 in +that it encodes character sets into 8-bit bytes. UCS-2 and UCS-4 have +problems with some file systems and utilities, so this UTF was +developed. + More detailed information on UTF-8 and its relationship with +ISO 10646-1:1993 can be found at the following URLs: + + http://www.stonehand.com/unicode/standard/utf8.html + ftp://unicode.org/pub/Programs/ConvertUTF/ + + X/Open Company Limited also published a document that +describes UTF-8 in detail (they call it FSS-UTF), and you can find +information about it at the following URL: + + http://www.xopen.co.uk/public/pubs/catalog/c501.htm + +The new programming language called Java supports Unicode through +UTF-8. More information on Java is at the following URL: + + http://www.javasoft.com/ + + +3.3.11: UTF-16 + + UTF-16 (formerly UCS-2E), like UTF-8, is now officially an +annex to ISO 10646-1:1993. From what I've read, UTF-16 transforms +UCS-4 into a 16-bit form. UTF-16 can then be further encoded in UTF-7 +or UTF-8 (but doing this is not according to the standard -- there is +little to gain by doing so). + More detailed information on UTF-16 and its relationship with +ISO 10646-1:1993 can be found at the following URLs: + + http://www.stonehand.com/unicode/standard/utf16.html + ftp://unicode.org/pub/Programs/ConvertUTF/ + + +3.3.12: ANSI Z39.64-1989 + + The encoding used for ANSI Z39.64-1989 (and CCCII) is three- +byte 7-bit ISO 2022, namely the following code space: + + Three-byte ANSI Z39.64-1989 Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x21-0x7E + second byte range 0x21-0x7E + third byte range 0x21-0x7E + + +3.3.13: BASE64 + + Base64 encoding is mentioned here only because of its common +usage in e-mail headers, and relationship with MIME (Multi-purpose +Internet Mail Extensions). It is also a source of confusion. Base64 is +a method of encoding arbitrary bytes into the safest 64-character +ASCII subset, and is defined in RFC 1341 (which adapted it from RFC +1113). RFC 1341 was made obsolete by RFC 1521. RFC 1522 also provides +useful information, particularly for handling non-ASCII text, and +obsoletes RFC 1342. + Here is how it works. Every three bytes are encoded as a +four-byte sequence. That is, the 24 bits that make up the three bytes +are split into four 6-bit segments (6 bits can encode up to 64 +characters). Each 6-bit segment is then converted into a character in +the Base64 Alphabet (see below). There is a 65th character, "=", which +has a special purpose (it functions as a "pad" if a full three-byte +sequence is not found). This all may sound a bit like uuencoding, but +it is different. The Base64 Alphabet is as follows: + + ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ + + My name, written in Japanese kanji, is as follows when it is +EUC-encoded (six bytes, expressed as three groups of hexadecimal +values, one group for each character): + + 0xBEAE 0xCED3 0xB7F5 + +When these three EUC-encoded characters are converted to Base64 +encoding, they appear as follows (eight bytes): + + vq7O07f1 + + Base64 encoding is most commonly used for encoding non-ASCII +text that appears in e-mail headers. Of all the portions of an e-mail +message, its header gets manipulated the most during transmission, and +Base64 encoding offers a safe way to further encode non-ASCII text so +that it is not altered by mail-routing software. This is where Base64 +encoding can cause confusion. For example, what goes through your mind +when you see the following chunk o' text? + + From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=) + +Many folks think that they are seeing ISO-2022-JP encoding. Not +true. The "ISO-2022-JP" portion is just a flag that indicates the +original encoding before Base64 encoding was applied. The actual +Base64-encoded portion is enclosed between question marks (?) as +follows: + + From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=) + >^^^^^^^^< + +The whole string enclosed in parentheses has several components, and +the following explains their purpose and relationships (using the +above string as an example): + + Component Explanation + ^^^^^^^^^ ^^^^^^^^^^^ + =? Signals start of encoded string + ISO-2022-JP Charset name ("ISO-2022-JP" is for Japanese) + ? Delimiter + B Encoding ("B" is for Base64) + ? Delimiter + vq7O07f1 Example string of type "charset" encoded by "encoding" + ?= Signals end of encoded string + + One typically does not need to worry about encoding text as +Base64 (MIME-compliant mailing software usually performs this task for +you). The problem is usually trying to decode Base64-encoded text. A +Base64 decoder is available in Perl at the following URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/b64decode.pl + +Note that this program takes "raw" Base64 data as input. Any non- +Base64 stuff must be stripped. I usually run this from within Mule +("C-u M-| b64decode.pl") after defining a region around the Base64- +encoded material. I hope to replace this program soon with one that +automatically recognizes the Base64-encoded portions. + Most MIME-compliant e-mail software can decode Base64-encoded +text. + + +3.3.14: IBM DBCS-HOST + + The oldest two-byte encoding system is IBM's DBCS-Host. DBCS +stands for Double-Byte Character Set. DBCS-Host is still in use on +IBM's mainframe computer systems (hence the use of "Host"). + DBCS-Host encoding is EBCDIC-based, and uses Shift characters, +0x0E and 0x0F, to switch between one- and two-byte mode. Its encoding +specifications are as follows: + + Two-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x41-0xFE + second byte range 0x41-0xFE + + Two-byte "Space" Character Code Point + ^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ + first- and second byte 0x4040 + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + EBCDIC 0x41-0xF9 + + Shifting Characters Code Point + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + Two-byte 0x0E + One-byte 0x0F + +This same encoding specification is shared by all of IBM's CJK +character sets, namely for Japanese, Simplified Chinese, Traditional +Chinese, and Korean. + + +3.3.15: IBM DBCS-PC + + IBM's DBCS-PC encoding is used on IBM personal computers (that +is where the "PC" comes from). DBCS-PC encoding is ASCII-based, and +uses the values of characters' bytes themselves to switch between one- +and two-byte mode. Its encoding specifications are as follows: + + Two-byte Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0x81-0xFE + second byte range 0x40-0x7E, 0x80-0xFE + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + ASCII 0x21-0x7E + +This same encoding specification is shared by all of IBM's CJK +character sets, namely for Japanese, Simplified Chinese, Traditional +Chinese, and Korean. + DBCS-PC encoding for Japanese, although conforming to the +above encoding specifications, actually uses the same encoding +specifications for Shift-JIS, to include the full user-defined range +(see Section 3.3.1 for more details on Shift-JIS encoding). One big +accommodation is the half-width katakana range, namely 0xA1 through +0xDF. Further, the DBCS-PC code space that is outside the Shift-JIS +specification is unused. + DBCS-PC encoding for Korean uses the equivalent of EUC code +set 1 code points (0xA1A1 through 0xFEFE) for those characters that +are common with KS C 5601-1992. Those characters that are not common +with KS C 5601-1992, namely IBM's extensions, are within the DBCS-PC +encoding space, but outside EUC encoding space (0x9A through 0xA0). +Many hanja and pre-combined hangul are part of IBM's Korean extension. + Note that DBCS-PC is sort of useless without a corresponding +SBCS (Single-Byte Character Set) for the one-byte range. Mixing DBCS +and SBCS results in a MBCS (Multiple-Byte Character Set). How these +are mixed to form MBCSs is detailed in Section 3.4. + + +3.3.16: IBM DBCS-/TBCS-EUC + + IBM has also developed DBCS-EUC and TBCS-EUC encodings. TBCS +stands for Triple-Byte Character Set. These essentially follow the EUC +encoding specifications, and were developed for use with IBM's AIX +(Advanced Interactive Executive) operating system, which is +UNIX-based. + Refer to Section 3.2 for all the details on EUC encoding. + + +3.3.17: UNIFIED HANGUL CODE + + Microsoft has developed what is called "Unified Hangul Code" +(UHC) for its Windows 95 operating system (this was also known as +"Extended Wansung"). It is the optional, not standard, character set +of Win95K. + UHC provides full compatibility with KS C 5601-1992 EUC +encoding (see Section 3.2.4), but adds additional encoding ranges for +holding additional pre-combined hangul (more precisely, the 8,822 that +are needed to fully support the Johab character set). The following is +a table that provides the encoding ranges for UHC encoding: + + Two-byte Standard Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0x81-0xFE + second byte ranges 0x41-0x5A, 0x61-0x7A, + and 0x81-0xFE + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + ASCII 0x21-0x7E + +Note that 0xA1A1 through 0xFEFE in the above encoding is still +identical, in terms of character-to-code allocation, with KS C 5601- +1992 in EUC encoding. + Appendix G (pp 345-406) of "Developing International Software +for Windows 95 and Windows NT" by Nadine Kano illustrates the KS C +5601-1992 character set standard plus these Microsoft extensions +(8,822 pre-combined hangul) by UHC code (Microsoft calls this Code +Page 949). + + +3.3.18: TRON CODE + + TRON (The Real-time Operating system Nucleus) is an OS +developed in Japan some time ago. Personal Media Corporation has done +work to develop BTRON (Business TRON), which is unique in that it is +the only commercially-available OS that supports JIS X 0212-1990. + TRON Code provides a one- and two-byte encoding space and a +method for switching between them. + The following is how the two-byte space in TRON Code is +allocated: + + A-Zone (8,836 characters; JIS X 0208-1990) Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x21-0x7E + second byte range 0x21-0x7E + + B-Zone (11,844 characters; JIS X 0212-1990) Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x80-0xFD + second byte range 0x21-0x7E + + C-Zone (11,844 characters; unassigned) Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x21-0x7E + second byte range 0x80-0xFD + + D-Zone (15,876 characters; unassigned) Encoding Range + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + first byte range 0x80-0xFD + second byte range 0x80-0xFD + +Note how the B-Zone is larger that the conventional 94-by-94 +matrix. In fact, the JIS X 0212-1990 portion of the B-Zone is +restricted to 0xA121-0xFD7E (93-by-94 matrix -- 0xFE as a first-byte +value is unavailable, and you will see why in a minute). + TRON Code implements "language specifying codes" consisting of +two bytes as follows: + + Two-byte Japanese 0xFE21 + One-byte English 0xFE80 + +0xFE21 in a one-byte stream invokes two-byte Japanese mode, and 0xFE80 +in a two-byte stream invokes one-byte English mode. + The following is the one-byte encoding range for TRON Code: + + One-byte Characters 0x21-0x7E and 0x80-0xFD + +Control codes are in 0x00-0x20 and 0x7F (the usual ASCII control code +range). Also, 0xA0 is reserved as a fixed-width space character. + + +3.3.19: GBK + + GBK is an extension to GB 2312-80 that adds all ISO 10646- +1:1993 (GB 13000.1-93) hanzi not already in GB 2312-80. GBK is defined +as a normative annex of GB 13000.1-93 (see Section 2.2.10). The "K" in +"GBK" is the first sound in the Chinese word meaning "extension" (read +"Kuo Zhan"). + GBK is divided into five levels as follows: + + Level Encoded Range Total Code Points Total Encoded Characters + ^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ + GBK/1 0xA1A1-0xA9FE 846 717 + GBK/2 0xB0A1-0xF7FE 6,768 6,763 + GBK/3 0x8140-0xA0FE 6,080 6,080 + GBK/4 0xAA40-0xFEA0 8,160 8,160 + GBK/5 0xA840-0xA9A0 192 166 + + There are also 1,894 user-defined code points as follows: + + Encoded Range Total Code Points + ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^ + 0xAAA1-0xAFFE 564 + 0xF8A1-0xFEFE 658 + 0xA140-0xA7A0 672 + + GBK thus provides a total of 23,940 code points, 21,886 of +which are assigned. + Each "row" in the GBK code table consists of 190 characters. +The following describes the encoding ranges of GBK in detail: + + Two-byte Standard Characters Encoding Ranges + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + first byte range 0x81-0xFE + second byte ranges 0x40-0x7E and 0x80-0xFE + + One-byte Characters Encoding Range + ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ + ASCII 0x21-0x7E + +Note that the sub-range 0xA1A1-0xFEFE in the above encoding is still +identical, in terms of character-to-code allocation, with GB 2312-80 +in EUC encoding. GBK is therefore backward-compatible with GB 2312-80 +and forward-compatible with ISO 10646-1:1993. + GBK is the standard character set and encoding for the +Simplified Chinese version of Windows 95. + + +3.4: CJK CODE PAGES + + Many times one reads about references to "Code Pages" in +material about CJK (and other) character sets and encodings. These are +not literal pages, but rather references to a character set and +encoding combination. In the case of CJK Code Pages, they definitely +comprise more than one page! + Microsoft refers to its supported CJK character sets and +encodings through such Code Page designations. The following is a +listing of several Microsoft CJK Code Pages along with their +characteristics: + + Code Page Characteristics + ^^^^^^^^^ ^^^^^^^^^^^^^^^ + 932 JIS X 0208-1990 base, Shift-JIS encoding, Microsoft + extensions (NEC Row 13 and IBM select characters in + redundantly encoded in Rows 89 through 92 and Rows 115 + through 119) + 936 GB 2312-80 base, EUC encoding + 949 KS C 5601-1992 base, Unified Hangul Code encoding, + remaining 8,822 pre-combined hangul as extension (all of + this is referred to as Unified Hangul Code) + 950 Big Five base, Big Five encoding, Microsoft extensions + (actually, the ETen extensions of Row 89) + 1361 Johab base, Johab encoding + + IBM also uses Code Page designations, and, in fact, some +designations (and associated characteristics) are nearly identical to +those in the above table, most notably, Code Pages 932 and 936. IBM's +Code Page 932 does not include NEC Row 13 or IBM select characters in +Rows 89 through 92. + The best way to describe IBM Code Page designations is by +first listing the SBCS (Single-Byte Character Set) and DBCS (Double- +Byte Character Set) Code Page designations (those designated by "Host" +use EBCDIC-based encodings): + + IBM SBCS Code Page Characteristics + ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + 37 (US) SBCS-Host + 290 (Japanese) SBCS-Host + 833 (Korean) SBCS-Host + 836 (Simplified Chinese) SBCS-Host + 891 (Korean) SBCS-PC + 897 (Japanese) SBCS-PC + 903 (Simplified Chinese) SBCS-PC + 904 (Traditional Chinese) SBCS-PC + + IBM DBCS Code Page Characteristics + ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + 300 (Japanese) DBCS-Host + 301 (Japanese) DBCS-PC + 834 (Korean) DBCS-Host + 835 (Traditional Chinese) DBCS-Host + 837 (Simplified Chinese) DBCS-Host + 926 (Korean) DBCS-PC + 927 (Traditional Chinese) DBCS-PC + 928 (Simplified Chinese) DBCS-PC + +So far there appears to be no relationship with Microsoft's CJK Code +Pages, but when we combine the above SBCS and DBCS Code Pages into +MBCS (Multiple-Byte Character Set) Code Pages, things become a bit +more revealing: + + IBM MBCS Code Page Characteristics + ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + 930 (Japanese) MBCS-Host (Code Pages 300 and 290) + 932 (Japanese) MBCS-PC (Code Pages 301 and 897) + 933 (Korean) MBCS-Host (Code Pages 834 and 833) + 934 (Korean) MBCS-PC (Code Pages 926 and 891) + 938 (Traditional Chinese) MBCS-PC (Code Pages 927 and 904) + 936 (Simplified Chinese) MBCS-PC (Code Pages 928 and 903) + 5031 (Simplified Chinese) MBCS-Host (Code Pages 837 and 836) + 5033 (Traditional Chinese) MBCS-Host (Code Pages 835 and 37) + +So, you can now see that many of Microsoft's CJK Code Pages are +derived from those established by IBM. + More detailed information on the encoding specifications for +DBCS-Host and DBCS-PC can be found in Sections 3.3.14 and 3.3.15, +respectively. + + +PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES + + The sections below provide detailed information about +compatibility issues between CJK character sets, to include tidbits of +useful information. + One thing to mention first is that conversion to and from +IBM's DBCS-Host (Section 3.3.14) and DBCS-PC (Section 3.3.15) +encodings is table-driven, and fully documented in the following IBM +publication: + +o IBM Corporation. "Character Data Representation Architecture - Level + 2, Registry." 1993. IBM order number SC09-1391-01. + +Unfortunately, the CJK-related tables are not supplied in machine- +readable format, and must be obtained from IBM directly. The only real +compatibility issue is trying to obtain the conversion tables from +IBM. + + +4.1: JAPANESE + + In general, when a Japanese character set was revised, +characters were simply added (usually appended at the end). However, +when JIS C 6226-1978 was revised in 1983 (to become JIS X 0208-1983), +a bit more happened (this is still a controversy). + A detailed treatment of the two main transitions, JIS C 6226- +1978 to JIS X 0208-1983 and JIS X 0208-1983 to JIS X 0208-1990, is +covered in Appendix J of UJIP. I provide machine-readable files that +detail these transitions at the following URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/ + + An interesting side note here is that there is a reason why +there are many lists that illustrate JIS C 6226-1978 and JIS X 0208- +1983 kanji form differences. While most share the same basic set of +changes, there are some inconsistencies. Well, it turns out that JIS C +6226-1978 had ten printings, and not all of them shared the same kanji +forms. If comparisons between JIS C 6226-1978 and JIS X 0208-1983 were +made using different printings of the JIS C 6226-1978 manual, the +results can differ slightly. + There are also interesting correspondences between JIS X +0208-1990 and JIS X 0212-1990. 28 kanji that vanished during the JIS C +6226-1978 to JIS X 0208-1983 transition (they were replaced by +simplified versions) were restored in JIS X 0212-1990 (at totally +different code points). Appendix J of UJIP discusses this, and a file +at the following URL details the 28 mappings: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/TJ2.jis + + +4.2: CHINESE (PRC) + + The basic PRC standard, GB 2312-80, has been revised, but not +through a later version of the standard. Instead, the revisions were +carried out in the form of three other documents. Specifically, they +are (in order of publication): + +o GB 6345.1-86 (see Section 2.2.3) +o GB 8565.2-88 (see Section 2.2.6) +o GB/T 12345-90 (see Section 2.2.7) + +Unless you are aware of these documents, figuring out what has been +corrected or added to GB 2312-80 is nearly impossible. + + +4.3: CHINESE (TAIWAN) + + The first question people think of with regard to Big Five and +CNS 11643-1992 is compatibility. It turns out that Planes 1 and 2 of +CNS 11643-1992 are more or less equivalent to Big Five, but a handful +of hanzi are in a different order. The following tables detail the +mapping from Big Five (with the ETen extension) to CNS 11643-1992 +(when using this conversion table, keep in mind the encoding space +ranges for both Big Five and CNS 11643-1992): + +Big Five Level 1 Correspondence to CNS 11643-1992 Plane 1: + + 0xA140-0xA1F5 <-> 0x2121-0x2256 + 0xA1F6 <-> 0x2258 + 0xA1F7 <-> 0x2257 + 0xA1F8-0xA2AE <-> 0x2259-0x234E + 0xA2AF-0xA3BF <-> 0x2421-0x2570 + 0xA3C0-0xA3E0 <-> 0x4221-0x4241 # Symbols for control characters + 0xA440-0xACFD <-> 0x4421-0x5322 # Level 1 Hanzi BEGIN + 0xACFE <-> 0x5753 + 0xAD40-0xAFCF <-> 0x5323-0x5752 + 0xAFD0-0xBBC7 <-> 0x5754-0x6B4F + 0xBBC8-0xBE51 <-> 0x6B51-0x6F5B + 0xBE52 <-> 0x6B50 + 0xBE53-0xC1AA <-> 0x6F5C-0x7534 + 0xC1AB-0xC2CA <-> 0x7536-0x7736 + 0xC2CB <-> 0x7535 + 0xC2CC-0xC360 <-> 0x7737-0x782C + 0xC361-0xC3B8 <-> 0x782E-0x7863 + 0xC3B9 <-> 0x7865 + 0xC3BA <-> 0x7864 + 0xC3BB-0xC455 <-> 0x7866-0x7961 + 0xC456 <-> 0x782D + 0xC457-0xC67E <-> 0x7962-0x7D4B # Level 1 Hanzi END + 0xC6A1-0xC6AA <-> 0x2621-0x262A # Circled numerals + 0xC6AB-0xC6B4 <-> 0x262B-0x2634 # Parenthesized numerals + 0xC6B5-0xC6BE <-> 0x2635-0x263E # Lowercase Roman numerals + 0xC6BF-0xC6C0 <-> 0x2723-0x2724 # 213 radicals BEGIN + 0xC6C1-0xC6C2 <-> 0x2726, 0x2728 + 0xC6C3-0xC6C5 <-> 0x272D-0x272F + 0xC6C6-0xC6C7 <-> 0x2734, 0x2737 + 0xC6C8-0xC6C9 <-> 0x273A, 0x273C + 0xC6CA-0xC6CB <-> 0x2742, 0x2747 + 0xC6CC-0xC6CD <-> 0x274E, 0x2753 + 0xC6CE-0xC6CF <-> 0x2754-0x2755 + 0xC6D0-0xC6D1 <-> 0x2759-0x275A + 0xC6D2-0xC6D3 <-> 0x2761, 0x2766 + 0xC6D4-0xC6D5 <-> 0x2829-0x282A + 0xC6D6-0xC6D7 <-> 0x2863, 0x286C # 213 radicals END + 0xC6D8-0xC6E6 -> ****** # Japanese symbols + 0xC6E7-0xC77A -> ****** # Hiragana + 0xC77B-0xC7F2 -> ****** # Katakana + 0xC7F3-0xC875 -> ****** # Cyrillic alphabet + 0xC876-0xC878 -> ****** # Symbols + 0xC87A -> ****** # Hanzi element + 0xC87C -> ****** # Hanzi element + 0xC87E-0xC8A1 -> ****** # Hanzi elements + 0xC8A3-0xC8A4 -> ****** # Hanzi elements + 0xC8A5-0xC8CC -> ****** # Combined numerals + 0xC8CD-0xC8D3 -> ****** # Japanese symbols + +Big Five Level 1 Correspondences to CNS 11643-1992 Plane 4: + + 0xC879 <-> 0x2123 # Hanzi element + 0xC87B <-> 0x2124 # Hanzi element + 0xC87D <-> 0x212A # Hanzi element + 0xC8A2 <-> 0x2152 # Hanzi element + +Big Five Level 2 Correspondence to CNS 11643-1992 Plane 1: + + 0xC94A -> 0x4442 # duplicate of 0xA461 + +Big Five Level 2 Correspondences to CNS 11643-1992 Plane 2: + + 0xC940-0xC949 <-> 0x2121-0x212A # Level 2 Hanzi BEGIN + 0xC94B-0xC96B <-> 0x212B-0x214B + 0xC96C-0xC9BD <-> 0x214D-0x217C + 0xC9BE <-> 0x214C + 0xC9BF-0xC9EC <-> 0x217D-0x224C + 0xC9ED-0xCAF6 <-> 0x224E-0x2438 + 0xCAF7 <-> 0x224D + 0xCAF8-0xD6CB <-> 0x2439-0x376E + 0xD6CC <-> 0x3E63 + 0xD6CD-0xD779 <-> 0x3770-0x387D + 0xD77A <-> 0x3F6A + 0xD77B-0xDADE <-> 0x387E-0x3E62 + 0xDADF <-> 0x376F + 0xDAE0-0xDBA6 <-> 0x3E64-0x3F69 + 0xDBA7-0xDDFB <-> 0x3F6B-0x4423 + 0xDDFC -> 0x4176 # duplicate of 0xDCD1 + 0xDDFD-0xE8A2 <-> 0x4424-0x554A + 0xE8A3-0xE975 <-> 0x554C-0x5721 + 0xE976-0xEB5A <-> 0x5723-0x5A27 + 0xEB5B-0xEBF0 <-> 0x5A29-0x5B3E + 0xEBF1 <-> 0x554B + 0xEBF2-0xECDD <-> 0x5B3F-0x5C69 + 0xECDE <-> 0x5722 + 0xECDF-0xEDA9 <-> 0x5C6A-0x5D73 + 0xEDAA-0xEEEA <-> 0x5D75-0x6038 + 0xEEEB <-> 0x642F + 0xEEEC-0xF055 <-> 0x6039-0x6242 + 0xF056 <-> 0x5D74 + 0xF057-0xF0CA <-> 0x6243-0x6336 + 0xF0CB <-> 0x5A28 + 0xF0CC-0xF162 <-> 0x6337-0x642E + 0xF163-0xF16A <-> 0x6430-0x6437 + 0xF16B <-> 0x6761 + 0xF16C-0xF267 <-> 0x6438-0x6572 + 0xF268 <-> 0x6934 + 0xF269-0xF2C2 <-> 0x6573-0x664C + 0xF2C3-0xF374 <-> 0x664E-0x6760 + 0xF375-0xF465 <-> 0x6762-0x6933 + 0xF466-0xF4B4 <-> 0x6935-0x6961 + 0xF4B5 <-> 0x664D + 0xF4B6-0xF4FC <-> 0x6962-0x6A4A + 0xF4FD-0xF662 <-> 0x6A4C-0x6C51 + 0xF663 <-> 0x6A4B + 0xF664-0xF976 <-> 0x6C52-0x7165 + 0xF977-0xF9C3 <-> 0x7167-0x7233 + 0xF9C4 <-> 0x7166 + 0xF9C5 <-> 0x7234 + 0xF9C6 <-> 0x7240 + 0xF9C7-0xF9D1 <-> 0x7235-0x723F + 0xF9D2-0xF9D5 <-> 0x7241-0x7244 # Level 2 Hanzi END + 0xF9DD-0xF9FE -> ****** # Symbols + +Big Five Level 2 Correspondence to CNS 11643-1992 Plane 3: + + 0xF9D6 <-> 0x4337 # ETen-specific hanzi + 0xF9D7 <-> 0x4F50 # ETen-specific hanzi + 0xF9D8 <-> 0x444E # ETen-specific hanzi + 0xF9D9 <-> 0x504A # ETen-specific hanzi + 0xF9DA <-> 0x2C5D # ETen-specific hanzi + 0xF9DB <-> 0x3D7E # ETen-specific hanzi + 0xF9DC <-> 0x4B5C # ETen-specific hanzi + +I adapted the above from material Ross Paterson (rap@doc.ic.ac.uk) +kindly made available at the following URL: + + http://www.ifcss.org:8001/www/pub/software/info/cjk-codes/ + +Check it out. Basically, I just changed the CNS 11643-1992 codes from +decimal row-cell values to hexadecimal codes, and corrected the +mappings to correspond to ETen's Big Five (which is considered to be +the most standard). + It turns out that corrections were made to Big Five (at least +in the ETen and Microsoft implementations thereof) which made it a bit +closer to CNS 11643-1992 as far as character ordering is concerned. +The following six lines of code correspondences: + + 0xCAF8-0xD6CB <-> 0x2439-0x376E + 0xD6CC <-> 0x3E63 + 0xD6CD-0xD779 <-> 0x3770-0x387D + 0xD77A <-> 0x3F6A + 0xD77B-0xDADE <-> 0x387E-0x3E62 + 0xDADF <-> 0x376F + +can now be expressed as the following three lines: + + 0xCAF8-0xD779 <-> 0x2439-0x387D + 0xD77A <-> 0x3F6A + 0xD77B-0xDBA6 <-> 0x387E-0x3F69 + +In essence, the ordering of Big Five characters 0xD6CC and 0xDADF were +reversed. This resulted in the same order as found in CNS 11643-1992 +Plane 2. + As for the two duplicate hanzi in Big Five (as indicated in +the above tables), they have been placed into a compatibility zone in +ISO 10646-1:1993 (this allows for round-trip conversion). The mapping +is as follows: + + Big Five ISO 10646-1:1993 + ^^^^^^^^ ^^^^^^^^^^^^^^^^ + 0xC94A -> 0xFA0C + 0xDDFC -> 0xFA0D + + Speaking of duplicate hanzi, Plane 1 of CNS 11643-1992 +contains 213 classical radicals in rows 27 through 29. However, 187 of +them map directly to hanzi code points in Planes 1, 2, and 3 (and +naturally to Big Five). Below is a detailed mapping of these 213 +radicals: + + Radical CNS 11643 Big Five Radical CNS 11643 Big Five + ^^^^^^^ ^^^^^^^^^ ^^^^^^^^ ^^^^^^^ ^^^^^^^^^ ^^^^^^^^ + 0x2721 -> 0x4421 0xA440 0x282E -> 0x4678 0xA5D8 + 0x2722 -> 0x2121 (3) ****** 0x282F -> 0x4679 0xA5D9 + 0x2723 -> 0x2122 (3) 0xC6BF 0x2830 -> 0x467A 0xA5DA + 0x2724 -> 0x2123 (3) 0xC6C0 0x2831 -> 0x467B 0xA5DB + 0x2725 -> 0x4422 0xA441 0x2832 -> 0x467C 0xA5DC + 0x2726 -> 0x2124 (3) 0xC6C1 0x2833 -> 0x2167 (2) 0xC9A8 + 0x2727 -> 0x4428 0xA447 0x2834 -> 0x467D 0xA5DD + 0x2728 -> ****** 0xC6C2 0x2835 -> 0x467E 0xA5DE + 0x2729 -> 0x4429 0xA448 0x2836 -> 0x4721 0xA5DF + 0x272A -> 0x442A 0xA449 0x2837 -> 0x484C 0xA6CB + 0x272B -> 0x442B 0xA44A 0x2838 -> 0x484D 0xA6CC + 0x272C -> 0x442C 0xA44B 0x2839 -> 0x484E 0xA6CD + 0x272D -> 0x2127 (3) 0xC6C3 0x283A -> 0x484F 0xA6CE + 0x272E -> 0x2128 (3) 0xC6C4 0x283B -> 0x2269 (2) 0xCA49 + 0x272F -> ****** 0xC6C5 0x283C -> 0x4850 0xA6CF + 0x2730 -> 0x442D 0xA44C 0x283D -> 0x4851 0xA6D0 + 0x2731 -> 0x2123 (2) 0xC942 0x283E -> 0x4852 0xA6D1 + 0x2732 -> 0x442E 0xA44D 0x283F -> 0x4854 0xA6D3 + 0x2733 -> 0x4430 0xA44F 0x2840 -> 0x4855 0xA6D4 + 0x2734 -> ****** 0xC6C6 0x2841 -> 0x4856 0xA6D5 + 0x2735 -> 0x4431 0xA450 0x2842 -> 0x4857 0xA6D6 + 0x2736 -> 0x2124 (2) 0xC943 0x2843 -> 0x4858 0xA6D7 + 0x2737 -> 0x2129 (3) 0xC6C7 0x2844 -> 0x485B 0xA6DA + 0x2738 -> 0x4432 0xA451 0x2845 -> 0x485C 0xA6DB + 0x2739 -> 0x4433 0xA452 0x2846 -> 0x485D 0xA6DC + 0x273A -> 0x212A (3) 0xC6C8 0x2847 -> 0x485E 0xA6DD + 0x273B -> 0x2125 (2) 0xC944 0x2848 -> 0x485F 0xA6DE + 0x273C -> 0x212B (3) 0xC6C9 0x2849 -> 0x4860 0xA6DF + 0x273D -> 0x4434 0xA453 0x284A -> 0x4861 0xA6E0 + 0x273E -> 0x4447 0xA466 0x284B -> 0x4862 0xA6E1 + 0x273F -> 0x212A (2) 0xC949 0x284C -> 0x4863 0xA6E2 + 0x2740 -> 0x4448 0xA467 0x284D -> 0x226A (2) 0xCA4A + 0x2741 -> 0x4449 0xA468 0x284E -> 0x226F (2) 0xCA4F + 0x2742 -> 0x213A (3) 0xC6CA 0x284F -> 0x4865 0xA6E4 + 0x2743 -> 0x444A 0xA469 0x2850 -> 0x4866 0xA6E5 + 0x2744 -> 0x444B 0xA46A 0x2851 -> 0x4867 0xA6E6 + 0x2745 -> 0x444C 0xA46B 0x2852 -> 0x4868 0xA6E7 + 0x2746 -> 0x444D 0xA46C 0x2853 -> 0x2270 (2) 0xCA50 + 0x2747 -> 0x213B (3) 0xC6CB 0x2854 -> 0x4B44 0xA8A3 + 0x2748 -> 0x4450 0xA46F 0x2855 -> 0x4B45 0xA8A4 + 0x2749 -> 0x4451 0xA470 0x2856 -> 0x4B46 0xA8A5 + 0x274A -> 0x4452 0xA471 0x2857 -> 0x4B47 0xA8A6 + 0x274B -> 0x4453 0xA472 0x2858 -> 0x4B48 0xA8A7 + 0x274C -> 0x212B (2) 0xC94B 0x2859 -> 0x4B49 0xA8A8 + 0x274D -> 0x4454 0xA473 0x285A -> 0x2524 (2) 0xCBA4 + 0x274E -> 0x213C (3) 0xC6CC 0x285B -> 0x4B4A 0xA8A9 + 0x274F -> 0x4456 0xA475 0x285C -> 0x4B4B 0xA8AA + 0x2750 -> 0x4457 0xA476 0x285D -> 0x4B4C 0xA8AB + 0x2751 -> 0x445A 0xA479 0x285E -> 0x4B4D 0xA8AC + 0x2752 -> 0x445B 0xA47A 0x285F -> 0x4B4E 0xA8AD + 0x2753 -> 0x213D (3) 0xC6CD 0x2860 -> 0x4B4F 0xA8AE + 0x2754 -> 0x213E (3) 0xC6CE 0x2861 -> 0x4B50 0xA8AF + 0x2755 -> 0x213F (3) 0xC6CF 0x2862 -> 0x4B51 0xA8B0 + 0x2756 -> 0x445C 0xA47B 0x2863 -> 0x272F (3) 0xC6D6 + 0x2757 -> 0x445D 0xA47C 0x2864 -> 0x4B57 0xA8B6 + 0x2758 -> 0x445E 0xA47D 0x2865 -> 0x4B5C 0xA8BB + 0x2759 -> 0x2140 (3) 0xC6D0 0x2866 -> 0x4B5D 0xA8BC + 0x275A -> 0x2142 (3) 0xC6D1 0x2867 -> 0x4B5E 0xA8BD + 0x275B -> 0x212C (2) 0xC94C 0x2868 -> 0x4F5A 0xAAF7 + 0x275C -> 0x4540 0xA4DF 0x2869 -> 0x4F5B 0xAAF8 + 0x275D -> 0x4541 0xA4E0 0x286A -> 0x4F5C 0xAAF9 + 0x275E -> 0x4542 0xA4E1 0x286B -> 0x4F5D 0xAAFA + 0x275F -> 0x4543 0xA4E2 0x286C -> 0x2A7D (3) 0xC6D7 + 0x2760 -> 0x4545 0xA4E4 0x286D -> 0x4F63 0xAB41 + 0x2761 -> 0x2167 (3) 0xC6D2 0x286E -> 0x4F64 0xAB42 + 0x2762 -> 0x4546 0xA4E5 0x286F -> 0x4F65 0xAB43 + 0x2763 -> 0x4547 0xA4E6 0x2870 -> 0x4F66 0xAB44 + 0x2764 -> 0x4548 0xA4E7 0x2871 -> 0x5372 0xADB1 + 0x2765 -> 0x4549 0xA4E8 0x2872 -> 0x5373 0xADB2 + 0x2766 -> 0x2169 (3) 0xC6D3 0x2873 -> 0x5374 0xADB3 + 0x2767 -> 0x454A 0xA4E9 0x2874 -> 0x5375 0xADB4 + 0x2768 -> 0x454B 0xA4EA 0x2875 -> 0x5376 0xADB5 + 0x2769 -> 0x454C 0xA4EB 0x2876 -> 0x5377 0xADB6 + 0x276A -> 0x454D 0xA4EC 0x2877 -> 0x5378 0xADB7 + 0x276B -> 0x454E 0xA4ED 0x2878 -> 0x5379 0xADB8 + 0x276C -> 0x454F 0xA4EE 0x2879 -> 0x537A 0xADB9 + 0x276D -> 0x4550 0xA4EF 0x287A -> 0x537B 0xADBA + 0x276E -> 0x213F (2) 0xC95F 0x287B -> 0x537C 0xADBB + 0x276F -> 0x4551 0xA4F0 0x287C -> 0x586B 0xB0A8 + 0x2770 -> 0x4552 0xA4F1 0x287D -> 0x586C 0xB0A9 + 0x2771 -> 0x4553 0xA4F2 0x287E -> 0x586D 0xB0AA + 0x2772 -> 0x4554 0xA4F3 0x2921 -> 0x334C (2) 0xD449 + 0x2773 -> 0x2141 (2) 0xC961 0x2922 -> 0x586E 0xB0AB + 0x2774 -> 0x4555 0xA4F4 0x2923 -> 0x334D (2) 0xD44A + 0x2775 -> 0x4556 0xA4F5 0x2924 -> 0x586F 0xB0AC + 0x2776 -> 0x4557 0xA4F6 0x2925 -> 0x5870 0xB0AD + 0x2777 -> 0x4558 0xA4F7 0x2926 -> 0x5E23 0xB3BD + 0x2778 -> 0x4559 0xA4F8 0x2927 -> 0x5E24 0xB3BE + 0x2779 -> 0x2142 (2) 0xC962 0x2928 -> 0x5E25 0xB3BF + 0x277A -> 0x455A 0xA4F9 0x2929 -> 0x5E26 0xB3C0 + 0x277B -> 0x455B 0xA4FA 0x292A -> 0x5E27 0xB3C1 + 0x277C -> 0x455C 0xA4FB 0x292B -> 0x5E28 0xB3C2 + 0x277D -> 0x455D 0xA4FC 0x292C -> 0x6327 0xB6C0 + 0x277E -> 0x4668 0xA5C8 0x292D -> 0x6328 0xB6C1 + 0x2821 -> 0x4669 0xA5C9 0x292E -> 0x6329 0xB6C2 + 0x2822 -> 0x466A 0xA5CA 0x292F -> 0x4155 (2) 0xDCB0 + 0x2823 -> 0x466B 0xA5CB 0x2930 -> 0x4875 (2) 0xE0EF + 0x2824 -> 0x466C 0xA5CC 0x2931 -> 0x676F 0xB9A9 + 0x2825 -> 0x466D 0xA5CD 0x2932 -> 0x6770 0xB9AA + 0x2826 -> 0x466E 0xA5CE 0x2933 -> 0x6771 0xB9AB + 0x2827 -> 0x4670 0xA5D0 0x2934 -> 0x6B7C 0xBBF3 + 0x2828 -> 0x4674 0xA5D4 0x2935 -> 0x6B7D 0xBBF4 + 0x2829 -> 0x225B (3) 0xC6D4 0x2936 -> 0x702F 0xBEA6 + 0x282A -> 0x225C (3) 0xC6D5 0x2937 -> 0x733E 0xC073 + 0x282B -> 0x4675 0xA5D5 0x2938 -> 0x733F 0xC074 + 0x282C -> 0x4676 0xA5D6 0x2939 -> 0x6142 (2) 0xEFB6 + 0x282D -> 0x4677 0xA5D7 + + +4.4: KOREAN + + The 268 duplicate hanja in KS C 5601-1992 can cause problems +when converting to and from other CJK character sets. When converting +from KS C 5601-1992, two or more hanja can collapse into a single code +point. When converting these 268 hanja to KS C 5601-1992, a decision +about which KS C 5601-1992 code point to map to must be made. The only +exception to this is mapping to and from ISO 10646-1:1993. That +standard encodes these 268 duplicate hanja in a compatibility zone, +namely from 0xF900 through 0xFA0B. + The following is a listing of 262 hanja that map to two or +more code points (four map to three code points, and one maps to four: +a total of 268 redundantly-encoded hanja) in KS C 5601-1992: + + Standard Extra Standard Extra Standard Extra + ^^^^^^^^ ^^^^^ ^^^^^^^^ ^^^^^ ^^^^^^^^ ^^^^^ + 0x4A39 -> 0x4D4F 0x5573 -> 0x6631 0x573C -> 0x6B29 + 0x4B3D -> 0x7A22 0x5574 -> 0x6633 0x573E -> 0x6B3A + 0x4C38 -> 0x7A66 0x5575 -> 0x6637 0x573F -> 0x6B3B + 0x4C5A -> 0x4B56 0x5576 -> 0x6638 0x5740 -> 0x6B3D + 0x4C78 -> 0x5050 0x5579 -> 0x663C 0x5741 -> 0x6B41 + 0x4D7A -> 0x4E2D 0x557B -> 0x6646 0x5743 -> 0x6B42 + 0x4E29 -> 0x7C29 0x557C -> 0x6647 0x5744 -> 0x6B46 + 0x4F23 -> 0x4F7B 0x557E -> 0x6652 0x5745 -> 0x6B47 + 0x4F4F -> 0x5022 0x5621 -> 0x6656 0x5747 -> 0x6B4C + 0x5038 0x5622 -> 0x6659 0x5748 -> 0x6B4F + 0x5142 -> 0x4B50 0x5623 -> 0x665F 0x5749 -> 0x6B50 + 0x5151 -> 0x505D 0x5624 -> 0x6661 0x574A -> 0x6B51 + 0x5159 -> 0x547C 0x5625 -> 0x6665 0x574C -> 0x6B58 + 0x5167 -> 0x552B 0x5626 -> 0x6664 0x574D -> 0x5270 + 0x522F -> 0x5155 0x5627 -> 0x6666 0x574E -> 0x5271 + 0x5233 -> 0x657C 0x5628 -> 0x6668 0x574F -> 0x5272 + 0x5234 -> 0x6644 0x562A -> 0x666A 0x5750 -> 0x5273 + 0x5235 -> 0x664A 0x562B -> 0x666B 0x5752 -> 0x5274 + 0x5236 -> 0x665C 0x562D -> 0x666F 0x5753 -> 0x5275 + 0x5237 -> 0x6676 0x562E -> 0x6671 0x5754 -> 0x5277 + 0x523A -> 0x6677 0x562F -> 0x6675 0x5755 -> 0x5278 + 0x523B -> 0x5638 0x5631 -> 0x6679 0x5757 -> 0x6C26 + 0x672C 0x5633 -> 0x6721 0x5759 -> 0x6C27 + 0x5241 -> 0x564D 0x5634 -> 0x6726 0x575B -> 0x6C2A + 0x5263 -> 0x6871 0x5635 -> 0x6729 0x575D -> 0x6C30 + 0x526E -> 0x6A74 0x5637 -> 0x672A 0x575E -> 0x6C31 + 0x526F -> 0x6B2A 0x563A -> 0x672D 0x5762 -> 0x6C35 + 0x527A -> 0x6C32 0x563B -> 0x6730 0x5765 -> 0x6C38 + 0x527B -> 0x6C49 0x563C -> 0x673F 0x5767 -> 0x6C3A + 0x527C -> 0x6C4A 0x563E -> 0x6746 0x576A -> 0x6C40 + 0x527E -> 0x7331 0x5640 -> 0x6747 0x576B -> 0x6C41 + 0x5321 -> 0x552E 0x5642 -> 0x674B 0x576C -> 0x6C45 + 0x5358 -> 0x7738 0x5643 -> 0x674D 0x576E -> 0x6C46 + 0x536B -> 0x7748 0x5644 -> 0x674F 0x5770 -> 0x6C55 + 0x5378 -> 0x7674 0x5645 -> 0x6750 0x5772 -> 0x6C5D + 0x5441 -> 0x5466 0x5647 -> 0x6753 0x5773 -> 0x6C5E + 0x5457 -> 0x7753 0x5649 -> 0x675F 0x5774 -> 0x6C61 + 0x547A -> 0x5154 0x564A -> 0x6764 0x5776 -> 0x6C64 + 0x547B -> 0x5158 0x564B -> 0x6766 0x5777 -> 0x6C67 + 0x547D -> 0x515B 0x564C -> 0x523E 0x5778 -> 0x6C68 + 0x547E -> 0x515C 0x564F -> 0x5242 0x5779 -> 0x6C77 + 0x5521 -> 0x515D 0x5650 -> 0x5243 0x577A -> 0x6C78 + 0x5522 -> 0x515E 0x5653 -> 0x5244 0x577C -> 0x6C7A + 0x5523 -> 0x515F 0x5654 -> 0x5246 0x5821 -> 0x6D21 + 0x5524 -> 0x5160 0x5655 -> 0x5247 0x5822 -> 0x6D22 + 0x5526 -> 0x5163 0x5656 -> 0x5248 0x5823 -> 0x6D23 + 0x5527 -> 0x5164 0x5657 -> 0x5249 0x5A72 -> 0x5B64 + 0x5528 -> 0x5165 0x5658 -> 0x524A 0x5C56 -> 0x5D25 + 0x552A -> 0x5166 0x565A -> 0x524B 0x5C5F -> 0x7870 + 0x552C -> 0x5168 0x565B -> 0x524D 0x5C74 -> 0x5D55 + 0x552D -> 0x5169 0x565C -> 0x524E 0x5D41 -> 0x5B45 + 0x552F -> 0x516A 0x565E -> 0x524F 0x5F2F -> 0x616D + 0x5530 -> 0x516B 0x565F -> 0x5250 0x5F52 -> 0x6D6E + 0x5531 -> 0x516D 0x5660 -> 0x5251 0x5F5D -> 0x5F61 + 0x5534 -> 0x516F 0x5661 -> 0x5252 0x5F63 -> 0x5E7E + 0x5535 -> 0x5170 0x5662 -> 0x5253 0x6063 -> 0x612D + 0x5536 -> 0x5172 0x5663 -> 0x5254 0x6672 + 0x5539 -> 0x5176 0x5665 -> 0x5255 0x607D -> 0x5F68 + 0x553D -> 0x517A 0x5666 -> 0x5256 0x6163 -> 0x574B + 0x5540 -> 0x517C 0x5667 -> 0x5257 0x6B52 + 0x5541 -> 0x517D 0x566B -> 0x5259 0x6226 -> 0x5E7C + 0x5543 -> 0x517E 0x566C -> 0x525A 0x6326 -> 0x6429 + 0x5544 -> 0x5222 0x566F -> 0x525E 0x635B -> 0x723D + 0x5545 -> 0x5223 0x5670 -> 0x525F 0x6427 -> 0x727A + 0x5546 -> 0x5227 0x5671 -> 0x5261 0x6442 -> 0x6777 + 0x5547 -> 0x5228 0x5674 -> 0x5262 0x6445 -> 0x5162 + 0x5548 -> 0x5229 0x5675 -> 0x6867 0x5525 + 0x5549 -> 0x522A 0x5676 -> 0x6868 0x6879 + 0x554D -> 0x522B 0x5677 -> 0x6870 0x6534 -> 0x652E + 0x554E -> 0x522D 0x5679 -> 0x6877 0x6636 -> 0x6C2F + 0x5552 -> 0x5232 0x567A -> 0x687B 0x6728 -> 0x6071 + 0x5553 -> 0x6531 0x567B -> 0x687E 0x6856 -> 0x6A41 + 0x5554 -> 0x6532 0x567E -> 0x6927 0x6C36 -> 0x5764 + 0x5555 -> 0x6539 0x5721 -> 0x692C 0x6C56 -> 0x666C + 0x5557 -> 0x653B 0x5723 -> 0x694C 0x6D29 -> 0x7427 + 0x5558 -> 0x653C 0x5724 -> 0x5264 0x6D33 -> 0x6E5B + 0x5559 -> 0x6544 0x5726 -> 0x5265 0x6F37 -> 0x746E + 0x555D -> 0x654E 0x5727 -> 0x5266 0x7263 -> 0x6375 + 0x555E -> 0x6550 0x5728 -> 0x5267 0x7333 -> 0x4B67 + 0x555F -> 0x6552 0x5729 -> 0x5268 0x7351 -> 0x5F33 + 0x5561 -> 0x6556 0x572B -> 0x5269 0x742C -> 0x7676 + 0x5564 -> 0x657A 0x572C -> 0x526A 0x7658 -> 0x6421 + 0x5565 -> 0x657B 0x5730 -> 0x526B 0x7835 -> 0x5C25 + 0x5566 -> 0x657E 0x5731 -> 0x6A65 0x786C -> 0x785B + 0x5569 -> 0x6621 0x5733 -> 0x6A77 0x7932 -> 0x5D74 + 0x556B -> 0x6624 0x5735 -> 0x6A7C 0x7A3C -> 0x7A21 + 0x556C -> 0x6627 0x5736 -> 0x6A7E 0x7B29 -> 0x6741 + 0x556F -> 0x662D 0x5738 -> 0x6B24 0x7C41 -> 0x4D68 + 0x5571 -> 0x662F 0x573A -> 0x6B27 0x7D3B -> 0x6977 + 0x5572 -> 0x6630 + +The above table represents a weekend of my time (but time well spent, +in my opinion). + + +4.5: ISO 10646-1:1993 + + The Chinese character subset of ISO 10646-1:1993 +has excellent round-trip conversion capability with the various +national character sets. Those national character sets with duplicate +characters, such as KS C 5601-1992 (268 hanja) and Big Five (2 hanzi), +have corresponding code points in ISO 10646-1:1993 within +a compatibility zone. See Sections 4.3 and 4.4 for more details. + Other issues regarding ISO 10646-1:1993 have to do with proper +character rendering (that is, how characters are displayed, printed, +or otherwise imaged). Many (sometimes) subtle character form +differences have been collapsed under ISO 10646-1:1993. Language or +locale was not one of the factors used in performing Han Unification. +This means that it is nearly impossible to create a single ISO 10646-1: +1993 font that meets the character form criteria of each of the four +CJK locales. An ISO 10646-1:1993 code point is not enough information +to render a Chinese character. If the font was specifically designed +for a single locale, it is a non-problem, but if there is any CJK +intent, text must be flagged for language or locale. + + +4.6: UNICODE + + One of the most interesting (and major) differences between +the current three flavors of Unicode are the number and arrangement of +pre-combined hangul. The following table provides a summary of the +differences: + + Unicode Number of Pre-combined Hangul UCS-2 Ranges + ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^ + Version 1.0 2,350 Basic Hangul 0x3400-0x3D3D + + Version 1.1 2,350 Basic Hangul 0x3400-0x3D3D + 1,930 Supplemental Hangul A 0x3D2E-0x44B7 + 2,376 Supplemental Hangul B 0x44BE-0x4DFF + + Version 2.0 11,172 Hangul 0xAC00-0xD7A3 + +Of the above three versions, the most controversial is Version 2.0. +Why? Because it is located in the user-defined range of Unicode +(O-Zone: 16,384 code points in 0xA000-0xDFFF), and occupies +approximately two-thirds of its space. + The information in the above table is courtesy of the +following useful document: + + ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt + +The same file is also mirrored at the following URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt + + +4.7: CODE CONVERSION TIPS + + There are two types of conversions that can be performed. The +first type is converting between different encodings for the same +character set. This is usually without problems (but not always). The +second type is converting from one character set to another (it is not +usually relevant whether the underlying encoding has changed or not). +This usually involves the handling of characters that are in one +character set, but not the other. So, what to do? + I suggest JConv for handling Japanese code conversion (this +means converting between JIS, Shift-JIS, and EUC encodings). This is +in the category of different encodings for the same character set. The +following URLs provide executables or source code: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-30.hqx + ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-dd-181.hqx + ftp://ftp.ora.com/pub/examples/nutshell/ujip/dos/jconv.exe + ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/jconv.c + +There are other programs available that do the same basic thing as +JConv, such as kc and nkf. They are available at the following URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/ + + For software and tables that handles Chinese code conversion +(this includes conversion to and from Japanese), I suggest browsing at +the following URLs: + + ftp://etlport.etl.go.jp/pub/iso-2022-cn/convert/ + ftp://ftp.ifcss.org/pub/software/dos/convert/ + ftp://ftp.ifcss.org/pub/software/mac/convert/ + ftp://ftp.ifcss.org/pub/software/ms-win/convert/ + ftp://ftp.ifcss.org/pub/software/unix/convert/ + ftp://ftp.ifcss.org/pub/software/vms/convert/ + ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/ + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/ + ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/ + http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html + +The latter URL has FTP links to tables created by Koichi Yasuoka +(yasuoka@kudpc.kyoto-u.ac.jp). + The following URLs provide utilities or tables for converting +between various Korean encodings (the last represent the same file): + + ftp://cair-archive.kaist.ac.kr/pub/hangul/code/ + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt + ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt + +A popular Korean code conversion utility seems to be "hcode" by +June-Yub Lee (jylee@cims.nyu.edu). + Finally, the following URLs provide many Unicode- and CJK- +related mapping tables: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/ + ftp://ftp.ora.com/pub/examples/nutshell/ujip/unicode/ + ftp://unicode.org/pub/MappingTables/ + http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html + +Note that the official and authoritative Unicode mapping tables (from +Unicode values to various international, national and vendor +standards) are maintained by the Unicode Consortium at the following +URL: + + ftp://unicode.org/pub/MappingTables/ + +Version 2.0 of "The Unicode Standard" (to be published by Addison- +Wesley shortly) will include these mapping tables on CD-ROM. + + +PART 5: CJK-CAPABLE OPERATING SYSTEMS + + The first step in being able to display CJK text is to obtain +an operating system that handles such text (or an application that +sets up its own CJK-capable environment). Below I describe how +different types of machines can handle CJK text. + Actually, for the first few releases of CJK.INF, these +subsections will be far from complete (some may even be empty!). The +purpose of CJK.INF is to provide detailed information on character set +standards and encoding systems, so I therefore consider this sort of +information secondary. + + +5.1: MS-DOS + + I am not aware of any CJK-capable MS-DOS operating system, but +localized versions do exist. CJK support has been introduced with +Microsoft's Windows operating system (see Section 5.2). + + +5.2: WINDOWS + + Microsoft has CJK versions of its Windows operating system +available. The latest versions of their Windows operating system are +called Windows 95 and Windows NT. Windows 95 supports the same +character sets and encodings as in Windows Version 3.1 -- Windows NT +supports Unicode (ISO 10646-1:1993). Contact Microsoft Corporation for +more details. The URL of their WWW Home Page is: + + http://www.microsoft.com/ + +Nadine Kano's "Developing International Software for Windows 95 and +Windows NT" provides abundant reference material for how CJK is +supported in Windows 95 and Windows NT. Check it out. + TwinBridge is a package that adds CJK functionality to non-CJK +Windows. Demo versions of TwinBridge for Japanese and Chinese are at +the following URLs: + + ftp://ftp.netcom.com/pub/tw/twinbrg/Japanese/demo/tbjdemo.zip + ftp://ftp.netcom.com/pub/tw/twinbrg/Chinese/demo/tbcdemo.zip + + Another useful CJK add-on for Windows 95 is NJWIN (see Section +7.10) by Hongbo Data Systems. + + +5.3: MACINTOSH + + Macintosh is well-known as a computer that was designed to +handle multilingual texts. There are currently fully-localized +operating systems available for Japanese (KanjiTalk), Chinese +(simplified and traditional available), and Korean (HangulTalk). In +addition, Apple has developed "Language Kits" (*LK) for Chinese (CLK) +and Japanese (JLK). A Korean Language Kit (KLK) will be released +shortly. + These localized operating systems can usually be installed +together in order to make your system CJK-capable. + The common portion of these CJK-capable operating systems is a +technology Apple calls "WorldScript II" ("WorldScript I" is for one- +byte scripts). It provides the basic one- and two-byte functionality. + + +5.4: UNIX AND X WINDOWS + + The typical encoding system used on UNIX and X Windows is EUC +(see Section 3.2). Many systems, such as IBM's AIX, can be configured +to handle both EUC and Shift-JIS (for Japanese). In addition, X11R6 (X +Window System, Version 11, Release 6) has many CJK-capable features. + If you have a fast PC and a good amount of RAM (more than +4MB), you should consider replacing MS-DOS (and Microsoft Windows, +too, if you have it) with Linux, which is a full-blown UNIX operating +system that runs on Intel processors. You can even run X Windows +(X11R6). "Running Linux" by Matt Welsh and Lar Kaufman is an excellent +guide to installing and using Linux. The companion volume, "Linux +Network Administrator's Guide" by Olaf Kirch is also useful. Because +there is a fine line -- or no line at all -- between a user and System +Administrator when using Linux, "Essential System Administration" +Second Edition by AEleen Frisch is a must-have. + Linux and Linux information are available at the following +URLs: + + ftp://sunsite.unc.edu/pub/Linux/ + http://sunsite.unc.edu/mdw/linux.html + +I personally use Linux, and find it quite useful and powerful. My bias +comes from being a UNIX user. But, you can't beat the price (free), +and all of my favorite text-manipulation tools (such as Perl) are +readily available. + + +5.5: OTHERS + + No information yet. + + +PART 6: CJK TEXT AND INTERNET SERVICES + + Part 5 described how CJK text is handled on a machine +internally, but this part goes into the implications of handling such +text externally, namely for information interchange purposes. This +boils down to handling CJK text on Internet services. + For more detailed information on how these and other Internet +services are used, I suggest "The Whole Internet User's Guide & +Catalog" by Ed Krol. For more information on setting up and +maintaining these and other Internet services, I suggest "Managing +Internet Information Services" by Cricket Liu et al. + + +6.1: ELECTRONIC MAIL + + The most basic Internet service is electronic mail (henceforth +to be called "e-mail"), which is virtually guaranteed to be available +to all users regardless of their system. + Several Internet standards (called RFCs, short for Request For +Comments) have been developed to describe how CJK text is to be handled +over e-mail systems (see Section A.3.4). + The bottom-line is that most e-mail systems do not support +8-bit characters (that is, bytes that have their 8th bit set). Some do +offer 8-bit support, but you can never know what path your e-mail +might take while on route to its recipient. This means that 7-bit ISO +2022 (or equivalent) is the ideal encoding to use when sending CJK +text through e-mail. If your operating system processes another +encoding system, you must convert from that encoding to one that is +compatible with 7-bit ISO 2022. + However, even 7-bit ISO 2022 encoding can get mangled by +mail-routing software -- the escape character, sometimes even part of +the escape sequence (meaning more than just the escape character), is +stripped. The JConv tool described in Section 4.7 restores stripped +escape sequences for Japanese 7-bit ISO 2022. + If your mailing software is MIME-compliant, there is a means +to identify the character set and encoding of the message using the +"charset" parameter. Some valid "charset" values include the +following: + +o iso-2022-jp (see Section 3.1.3) +o iso-2022-jp-2 (see Section 3.1.3) +o iso-2022-kr (see Section 3.1.4) +o iso-2022-cn (see Section 3.1.5) +o iso-2022-cn-ext (see Section 3.1.5) +o iso-8859-1 + +Insertion of these values should happen automatically. + A last-ditch effort to send CJK text through e-mail is to use +uuencode or Base64 encoding (see Section 3.3.13). Base64 is something +that is usually done automatically by mailing software -- explicit +Base64 encoding is not common. The recipient must then run uudecode or +a Base64 decoder to get the original file (if such utilities are +available). + + +6.2: USENET NEWS + + Usenet News follows many of the same requirements as e-mail, +namely that 7-bit ISO 2022 encoding is ideal. However, some newsgroups +use specific encoding methods, such as: + + alt.chinese.text (HZ encoding used for Chinese text) + alt.chinese.text.big5 (Big Five encoding used for Chinese text) + chinese.flame (UTF-7) + chinese.text.unicode (UTF-8) + +Also, the newsgroups in Korean (all begin with "han.*") use EUC (EUC- +KR) because the news-handling software in Korea has been designed to +handle eight-bit characters correctly. Mailing list versions of Korean +newsgroups are likely to use ISO-2022-KR encoding. + One common problem with Usenet News is that the escape +characters used in 7-bit ISO 2022 encoding are sometimes stripped, +usually by the software used to post the article. This can be quite +annoying. There are programs available, such as JConv, that repair +such files by restoring the escape characters. + Another common problem are news readers that do not allow +escape characters to function. One simple solution is to "pipe" the +article through a display command, such as "more," "page," "less," or +"cat." This is done by typing a "pipe" character (|) followed by the +command name anywhere within the article being displayed. + + +6.3: GOPHER + + The World-Wide Web (WWW) has almost eliminated the need for +using Gopher, so I won't discuss it here. Not that I don't appreciate +Gopher servers, but what I mean is that WWW browsing software permits +access to Gopher sites. + + +6.4: WORLD-WIDE WEB + + First, there are two types of WWW browsers available. The most +common type is the graphics-based browser (examples include Mosaic and +Netscape). Graphics-based browsers have the unfortunate requirement of +a TCP/IP (SLIP and PPP support these protocols) connection. Lynx and +the W3 client for Emacs, which are text-based browsers, can be run +from the host computer through a standard terminal connection. They +don't display all the pretty pictures that folks put into their WWW +documents, but you get all the text (this is, in many ways, a blessing +in disguise -- transferring graphics is what slows down graphics-based +browsers the most). When the W3 client is run using Mule, it becomes a +fully CJK-capable WWW browser. Both Lynx and the W3 client for Emacs +are freely available. A Japanese-capable Lynx is available at the +following URL: + + ftp://ftp.ipc.chiba-u.ac.jp/pub.asada/www/lynx/ + +There is also a WWW page that provides information on Japanese-capable +Lynx. Its URL is as follows: + + http://www.icsd6.tj.chiba-u.ac.jp/lynx/ + + When WWW documents first came online, there was no method for +handling CJK character sets. This has, fortunately, changed. As of +this writing, two commercial WWW browsers support Japanese. They are +Infomosaic by Fujitsu Limited, and Netscape Navigator by Netscape +Communications Corporation (Version 1.1 added Japanese support). Both +are graphics-based browsers. The former can be ordered at the +following URL: + + http://www.fujitsu.co.jp/ + +The latter can be found at the following URLs: + + http://www.netscape.com/ + ftp://ftp.netscape.com/ + + One can also use a delegate server to *filter* Japanese codes +to the one supported by your browser. It is also possible to +"Japanize" existing WWW browsers using assorted tools and patches. +Katsuhiko Momoi (momoi@tigger.stcloud.msus.edu) has authored an +excellent guide to Japanizing WWW browsers. Its URL is: + + http://condor.stcloud.msus.edu:20020/netscape.html + +I *highly* suggest reading it. + Japanese-capable WWW browsers support automatic detection of +the three Japanese encoding methods (JIS, Shift-JIS, and EUC). Hey, +but, what about support for the "C" and "K" of CJK? Attempting to +answer this question provides us an answer to another question: "What +is the best encoding method to use for CJK WWW documents?" + Encoding methods such as EUC and Shift-JIS provide for mixing +only two character sets. This is because they provide no way to *flag* +or *tag* text for locale (character set) information. Without flagging +information, it is impossible to distinguish Japanese EUC from Chinese +or Korean EUC. However, the escape sequences used in 7-bit ISO 2022 +encoding explicitly provide locale information. 7-bit ISO 2022 is +ideal for static documents, which is exactly what one finds on WWW. + My personal recommendation (for the short-term) is to compose +WWW documents (also called HTML documents; HTML stands for Hyper Text +Markup Language) using 7-bit ISO 2022 encoding. The escape sequences +themselves act as explicit flags that indicate locale. However, some +WWW clients are confused by 7-bit ISO 2022 encoding, but the products +by Netscape Communications and Fujitsu Limited prove that this can +work. See the following URL for a description of this problem: + + http://www.ntt.jp/japan/note-on-JP/LibWWW-patch.html + + Check out the following URLs for information on and proposals +for international support for WWW: + + http://www.ebt.com:8080/docs/multilingual-www.html + http://www.w3.org/hypertext/WWW/International/Overview/ + + There is currently an RFC in the works (called an Internet +Draft) to address the problem of internationalizing HTML by using +Unicode. It is very promising. The latest draft is available at the +following URLs: + + ftp://ds.internic.net/internet-drafts/draft-ietf-html-i18n-04.txt.Z + ftp://ftp.isi.edu/internet-drafts/draft-ietf-html-i18n-04.txt + ftp://munnari.oz.au/internet-drafts/draft-ietf-html-i18n-04.txt.Z + ftp://nic.nordu.net/internet-drafts/draft-ietf-html-i18n-04.txt + +Note that some have been compressed. + + +6.5: FILE TRANSFER TIPS + + Although CJK encoding systems such as Shift-JIS and EUC make +extensive use of 8-bit bytes, that does not mean that you need to +treat the data as binary. Such files are simply to be treated as text, +and should be transferred in text mode (for example, FTP's ASCII mode, +which is also called "Type A Transfer"). + When text files are transferred in binary mode (such as FTP's +BINARY mode, which is also called Type I Transfer"), line termination +characters are left unaltered. For example, when transferring a text +file from UNIX to Macintosh, a text transfer will translate the UNIX +newline (0x0A) characters to Macintosh carriage return (0x0D) +characters, but a binary transfer will make no such modifications. +Text-style conversion is typically desired. + The most common types of files that need to be handled as +binary include tar archives (*.tar), compressed files (*.Z, *.gz, +*.zip, *.zoo, *.lzh, and so on), and executables (*.exe, *.bin, and so +on). + + +PART 7: CJK TEXT HANDLING SOFTWARE + + This section describes various CJK-capable software packages. +I expect this section to grow with future versions of this document. I +define "CJK-capable" as being able to support Chinese, Japanese, and +Korean text. + The descriptions I provide below are intentionally short. You +are encouraged to use the information pointers to obtain further +information or the software itself. + + +7.1: MULE + + Mule (multilingual enhancement to GNU Emacs), written by +Kenichi Handa (handa@etl.go.jp), is the first (and only?) CJK-capable +editor for UNIX systems, and is freely available under the terms of +the GNU General Public License. Mule was developed from Nemacs +(Nihongo Emacs). + Mule is available at the following URL: + + ftp://etlport.etl.go.jp/pub/mule/ + + Mule, beginning with Version 2.2, includes handy utilities +(any2ps and m2ps) for printing files in any of the encodings supported +by Mule (which is a lot of encodings, by the way). These programs use +BDF fonts. See the beginning of Part 2 for a list of URLs that have +CJK BDF fonts. + GNU Emacs is a fine editor, and Mule takes it several steps +further by providing multilingual support. I personally use Mule +together with SKK (for Japanese input) -- it is a superb combination. + + +7.2: CNPRINT + + CNPRINT, developed by Yidao Cai (cai@neurophys.wisc.edu), is a +utility to print CJK text (or convert it to a PostScript file), and is +available for MS-DOS, VMS, and UNIX systems. A wide range of encoding +methods are supported by CNPRINT. + CNPRINT is available at the following URLs: + + ftp://ftp.ifcss.org/pub/software/{dos,unix,vms}/print/ + ftp://neurophys.wisc.edu/[public.cn]/ + + +7.3: MASS + + MASS (Multilingual Application Support Service), developed at +the National University of Singapore, is a suite of software tools +that speed and ease the development of UNIX-based CJK (actually, more +than just CJK) applications. It supports a wide variety of character +sets and encodings, including ISO 10646-1:1993 (UCS-2, UTF-7, and +UTF-8), EACC, and CCCII. + More information on MASS, to include contact information for +its developers, can be found at the following URL: + + http://www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html + + +7.4: ADOBE TYPE MANAGER (ATM) + + Adobe Type Manager for Macintosh, beginning with Version 3.8, +is CJK-capable (as long as the underlying operating system is CJK- +capable). Actually, ATM generically supports CID-keyed fonts, which +are based on a newly-developed file specification for fonts with large +numbers of characters (like CJK fonts). See Section 7.9 for more +details. + ATM is very easy to obtain. It is bundled with fonts and +applications from Adobe Systems (chances are you have ATM if you +recently purchased an Adobe product). But what about Windows? The +Windows version of ATM should soon follow with identical +functionality. + + +7.5: MACINTOSH SOFTWARE + + WorldScript II, a System Extension introduced with System 7, +provides multi-byte script handling, namely CJK support. If a +Macintosh product claims to support WorldScript II, chances are it is +CJK-capable (provided that your operating system has the necessary +extensions loaded). + The CJK encodings that are supported by WorldScript II capable +applications are the same as made available by the underlying +Macintosh operating system. No import/export of other encodings is +supported at the operating system level. You must run separate +conversion utilities for both import and export. Anyway, below are +some products that are known to be CJK capable. + Nisus Writer, written by Nisus Software, is fully CJK-capable +as long as you have the appropriate scripts installed (such as CLK for +Chinese or JLK for Japanese). A "Language Key" (read "dongle") is also +required for Chinese and Korean (and some one-byte scripts such as +Arabic and Hebrew). A demo version of Nisus Writer is available at the +following URL: + + ftp://ftp.nisus-soft.com/pub/nisus/demos/ + +Give it a try! Updates are also available at the same FTP site. Nisus +Software can be contacted using the following e-mail address or +through their WWW page: + + info@nisus-soft.com + http://www.nisus-soft.com/ + +I also suggest reading "The Nisus Way" by Joe Kissell. Chapter 13 +provides detailed information about using Nisus Writer with +WorldScript, and includes a CD-ROM containing among other things a +trial (expires after 90 days) version of Nisus Writer and a +non-expiring version of Nisus Compact. + ClarisWorks by Claris Corporation, beginning with Version 4.0, +is compatible with WorldScript II and all Apple language kits. This +translates into full CJK support. The following URL provides a trial +version of ClarisWorks: + + ftp://ftp.claris.com/pub/USA-Macintosh/Trial_Software/ + +The following URL has detailed information on this and other Claris +products: + + http://www.claris.com/ + + The latest version of WordPerfect by Novell Incorporated is +also compatible with WorldScript II. The following URL has detailed +information: + + http://wp.novell.com/tree.htm + + +7.6: MACBLUE TELNET + + Although MacBlue Telnet (a modified version of NCSA Telnet) is +Macintosh software, I describe it separately because it does not +require the various Apple Language Kits or localized operating +systems. There are also input methods, adapted from cxterm (see +Section 7.7), available that cover the CJK spectrum (Japanese, +Simplified Chinese, Traditional Chinese, and Korean). + MacBlue Telnet is available at the following URL: + + ftp://ftp.ifcss.org/pub/software/mac/networking/MacBlueTelnet/ + +Its associated CJK input methods are at the following URL: + + ftp://ftp.ifcss.org/pub/software/mac/input/ + + +7.7: CXTERM + + This program, cxterm, is a CJK-capable xterm for X Windows +(works with X11R4, X11R5, and X11R6). It is based on the X11R6 xterm. +It is available at the following URL: + + ftp://ftp.ifcss.org/pub/software/x-win/cxterm/ + + The following URL is for a program that adds Unicode +capability to cxterm: + + ftp://ftp.ifcss.org/pub/software/unix/convert/hztty-2.0.tar.gz + +The following URL adds support for other encodings to cxterm: + + ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz + + +7.8: UW-DBM + + UW-DBM, for Windows 3.1, Windows 95, and Windows NT, is a +program that allows users to handle Chinese (Big Five, GB-2312-80, or +HZ code), Japanese (Shift-JIS), and Korean (KS C 5601-1992) +simultaneously. More information on UW-DBM is available at the +following URL: + + http://www.gy.com/ccd/win95/cjkw95.htm + + A demo version of UW-DBM is available at the following URL: + + ftp://ftp.aimnet.com/pub/users/chinabus/uwdbm40.zip + + +7.9: POSTSCRIPT + + With the introduction of CID-keyed Font Technology, PostScript +has become fully CJK capable. + Adobe Systems has developed the following CJK character +collection for CID-keyed fonts (font developers are encouraged to +conform to these specifications): + + Character Collection CIDs Supported Character Sets & Encodings + ^^^^^^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Adobe-GB1-1 9,897 GB 2312-80 and GB/T 12345-90; 7-bit ISO + 2022 and EUC + Adobe-CNS1-0 14,099 Big Five (ETen extensions) and CNS + 11643-1992 Planes 1 and 2; Big Five, + 7-bit ISO 2022, and EUC + Adobe-Japan1-2 8,720 JIS X 0208-1990; Shift-JIS, 7-bit ISO + 2022, and EUC + Adobe-Japan2-0 6,068 JIS X 0212-1990; 7-bit ISO 2022 and EUC + Adobe-Korea1-1 18,155 KS C 5601-1992 (Macintosh extensions + plus Johab); 7-bit ISO 2022, EUC, UHC, + and Johab + +Note that Macintosh and Windows do not support any of the encodings +for Adobe-Japan2-0, thus fonts based on that specification are +unusable for those platforms. + Adobe Systems also have a few things in the works (that is, +they are either proposed or in draft form), all of which are +supplements to above character collections (that is, they add CIDs): + + Character Collection CIDs Supported Character Sets & Encodings + ^^^^^^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Adobe-CNS1-1 +6,018 Add CNS 11643-1992 Plane 3 support (30 + of the 6,148 hanzi are in Adobe-CNS1-0) + + To find out more about these CJK character collections or +CID-keyed font technology, contact the Adobe Developers Association. +Several CID-related documents have been published. ADA's contact +information is as follows: + + Adobe Developers Association + Adobe Systems Incorporated + 1585 Charleston Road + P.O. Box 7900 + Mountain View, CA 94039-7900 + USA + +1-415-961-4111 (phone) + +1-415-967-9231 (facsimile) + devsupp-person@adobe.com + http://www.adobe.com/Support/ + +Adobe Systems has recently developed the CID SDK (CID Software +Developers Kit), which is on a single CD-ROM. Contact the Adobe +Developers Association for information on obtaining a copy. + The complete CID-keyed font file specification and an overview +document are available at the following URLs (as a PostScript or PDF +[Adobe Acrobat] file, respectively): + + ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PSfiles/ + ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PDFfiles/ + +The file names (not provided above due to URL length) are: + + 5014.CMap_CIDFont_Spec.ps (complete CID engineering specification) + 5014.CMap_CIDFont_Spec.pdf + 5092.CID_Overview.ps (CID technology overview) + 5092.CID_Overview.pdf + +Other related files, most character collection specifications, are +available only in PDF format at the latter URL indicated above: + + 5004.AFM_Spec.pdf (Includes CID-keyed AFM specification) + 5078b.pdf (Adobe-Japan1-2 character collection) + 5079b.pdf (Adobe-GB1-0 character collection) + 5080b.pdf (Adobe-CNS1-0 character collection) + 5093b.pdf (Adobe-Korea1-0 character collection) + 5094.pdf (Adobe CJK CMap file descriptions) + 5097b.pdf (Adobe-Japan2-0 character collection) + +If you do not have Adobe Acrobat, there is a freely-available Acrobat +Reader (for Macintosh, Windows, MS-DOS, and UNIX) at the following +URL: + + ftp://ftp.adobe.com/pub/adobe/Applications/Acrobat/ + + I have also placed some CJK character collection materials, +including prototype Unicode (UCS-2 and UTF-8) CMap files, at the +following URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/adobe/ + +A sample (Adobe-Korea1-0) CIDFont is also available at the above URL. + There is also a somewhat brief description of CID-keyed fonts +at the end of Chapter 6 in UJIP. + + +7.10: NJWIN + + Hongbo Data Systems has recently release a ShareWare ($49 USD) +product called NJWIN whose purpose is to force the display of CJK text +in non-CJK applications running under US Windows 95. Actually, there +are two versions: full CJK and Japanese only. + NJWIN and its full description are available at the following +URL: + + http://www.njstar.com.au/njstar/njwin.htm + +Other (popular) URLs that carry NJWIN are as follows: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/windows/ + ftp://ftp.cc.monash.edu.au/pub/nihongo/ + + Hongbo Data Systems' e-mail address is: + + hongbo@njstar.com.au + +Their WWW Home Page is at the following URL: + + http://www.njstar.com.au/ + + +PART 8: CJK PROGRAMMING ISSUES + + This new section describes issues related to using specific +programming languages to process CJK text. + + +8.1: C AND C++ + + At one time I used C on a regular basis for my CJK programming +needs, and released three tools for others to use: JConv, JChar, and +JCode. While these tools are specific to Japanese, they can be easily +adapted for CJK use. Their source code is available at the following +URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/ + + I also provided several C code snippets in Chapter 7 of +UJIP. These are available in machine-readable form at the following +URL: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch7/ + + +8.2: PERL + + Although Perl does not have any special CJK facilities (note +that most implementations of C and C++ do not either), it provides a +powerful programming environment that is useful for many CJK-related +tasks. + The noteworthy features of Perl are associative arrays and +regular expressions. These are features not found in C or C++, and +allow one to write meaningful code in little time. + JPerl is an implementation of Perl that provides two-byte +support for Japanese (EUC or Shift-JIS encoding). It is not ideal +because JPerl scripts often cannot run under (non-Japanese) Perl. + If you often write programs for internal use, I suggest that +you check out Perl to see if it can offer you something. Chances are +that it can. A good place to start looking at Perl are through books +on the subject (see Section A.3.1) and at the following URL: + + http://www.perl.com/ + + For those who like additional reading, "The Perl Journal" is +starting up, and information is at the following URL: + + http://work.media.mit.edu/the_perl_journal/ + + +8.3: JAVA + + I am just starting to learn about the Java programming +language (and rightly so since my wife is Javanese!). It seems to have +a lot to offer. + The most interesting aspects of Java are: + +o Built-in support for Unicode and UTF-8. +o The programmer must write code in the object-oriented paradigm. +o Provides a portable way to supply compiled code. +o Security features for Internet use. + +More information on Java are at the following URLs: + + http://www.gamelan.com/ + http://www.javasoft.com/ + +Oh, Gamelan is the name of Javanese music. + Of the books about Java published thus far, the one I consider +to be the best is "Java in a Nutshell" by David Flanagan. + One programming feature of Perl that I dearly miss in Java are +regexes (regular expressions). Luckily, some kind person wrote a regex +package for Java based on Perl regexes. Information on this Java regex +package is available at the following URL: + + http://www.win.net/~stevesoft/pat/ + + +A FINAL NOTE + + I hope that the information presented here will prove +useful. I would like to keep the electronic version of this document +as up-to-date as possible, and through readers' input, I am able to +do so. + Many readers will notice that I am very heavy into UNIX and +Macintosh (well, I recently got my first PC). If anyone has any +information on CJK-capable interfaces for other platforms, please feel +free to send it to me, and I will be sure to include it in the next +version of CJK.INF. Please include sources for the software or +documentation by providing addresses, phone numbers, FTP sites, and so +on. + Please do not hesitate to ask me further question concerning +any subject presented in this document. + + +ACKNOWLEDGMENTS + + I would like to express my deepest thanks to Kazumasa Utashiro +of Internet Initiative Japan (IIJ). He taught to me how to send and +receive Japanese text using the 7-bit ISO 2022 codes back in 1989. +With his help I was able to write JAPAN.INF, my book, and this +document in order to inform others about what he has taught me plus +more. + Next, I thank all the folks at O'Reilly & Associates for +publishing UJIP. Special thanks to Tim O'Reilly for accepting the book +proposal, and to Peter Mui for guiding me through the process. I have +had nothing but good experiences with "them there fine folks." + I got to know Jack Halpern through UJIP, and he subsequently +translated it into Japanese. Many thanks to him. + I am also grateful to my employer, Adobe Systems, for letting +me work on interesting CJK-related projects. I really like what I do +here. In particular, I want to thank Dan Mills, my manager, for +putting up with me for these past four years. + Lastly, I would also like to thank the countless people who +provided comments on JAPAN.INF, UJIP, and CJK.INF. I hope that this +new document lives up to the spirit of my previous efforts. + + +APPENDIX A: OTHER INFORMATION SOURCES + + One of the most useful types of information are pointers to +other information sources. This appendix provides just that. + + +A.1: USENET NEWSGROUPS AND MAILING LISTS + + Appendix L of UJIP provided information on a number of mailing +lists. This section supplements that appendix with information on +other useful mailing lists, and points out which ones in UJIP are +relevant to readers of CJK.INF. + + +A.1.1: USENET NEWSGROUPS + + The following Usenet Newsgroups typically have postings with +information relevant to issues discussed in CJK.INF (in alphabetical +order): + + alt.chinese.computing + alt.chinese.text (HZ encoding used for Chinese text) + alt.chinese.text.big5 (Big Five encoding used for Chinese text) + alt.japanese.text (JIS encoding used for Japanese text) + chinese.flame (UTF-7) + chinese.text.unicode (UTF-8) + comp.lang.c + comp.lang.c++ + comp.lang.java + comp.lang.perl.misc + comp.software.international + comp.std.internat + fj.editor.mule (JIS encoding used for Japanese text) + fj.kanji (JIS encoding used for Japanese text) + fj.net.infosystems.www.browsers (JIS encoding used for Japanese text) + fj.news.reader (JIS encoding used for Japanese text) + han.comp.hangul + han.sys.mac + sci.lang.japan (JIS encoding used for Japanese text) + + If your local news host does not provide a feed of the fj.* +newsgroups (shame on them!), or if you do not have access to Usenet +News, you can alternatively fetch them from the following URL: + + ftp://kuso.shef.ac.uk/pub/News/ + +The subdirectories correspond to the newsgroup name, but with the +"dots" being replaced by "slashes." For example, the "fj.binaries.mac" +newsgroup is archived in the "fj/binaries/mac" subdirectory. Many +thanks to Earl Kinmonth (jp1ek@sunc.shef.uc.uk) for this service. + There are some sites that carry full feeds of the fj.* +newsgroups, and permit public access (meaning that you can configure +your news reader to point to it). The only one I know of thus far is +as follows: + + ume.cc.tsukuba.ac.jp + + +A.1.2: MAILING LISTS + + The following are mailing lists that should interest readers +of this document (some are more active than others). The first line +after each entry indicates the address (or addresses) that can be used +for subscribing. The second line is the address for posting. + +o CCNET-L MAILING LIST + listserv@uga.uga.edu (or listserv@uga) + ccnet-l@uga.uga.edu + +o China Net Mailing List + majordomo@lists.mindspring.com + (See http://www.asia-net.com/ or jobs@asia-net.com) + +o EASUG (East Asian Software Users Group) Mailing List + easug-request@guvax.acc.georgetown.edu + easug@guvax.acc.georgetown.edu + +o EBTI-L (Electronic Buddhist Text Initiative) Mailing List + ebti-l-request@uxmail.ust.hk + ebti-l@uxmail.ust.hk + +o EFJ (Electronic Frontiers Japan) Mailing List + majordomo@lists.twics.com + efj@lists.twics.com + +o Hangul Mailing List (han.comp.hangul newsgroup) + majordomo@cair.kaist.ac.kr + hangul@cair.kaist.ac.kr + +o INSOFT-L Mailing List + majordomo@trans2.b30.ingr.com + insoft-l@trans2.b30 + +o ISO 10646 Mailing List + listproc@listproc.hcf.jhu.edu + iso10646@listproc.hcf.jhu.edu + +o Japan Net Mailing List + majordomo@lists.mindspring.com + (See http://www.asia-net.com/ or jobs@asia-net.com) + +o KanjiTalk Mailing List + kanjitalk-request@cs15.atr-sw.atr.co.jp (or kanjitalk-request@crl.go.jp) + kanjitalk@cs15.atr-sw.atr.co.jp (or kanjitalk@crl.go.jp) + +o Mac Mailing List (han.sys.mac newsgroup) + majordomo@krnic.net + mac@krnic.net + +o Mule Mailing List + mule-request@etl.go.jp + mule@etl.go.jp or mule-jp@etl.go.jp + +o NIHONGO Mailing List (sci.lang.japan newsgroup) + listserv@mitvma.mit.edu (or listserv@mitvma) + nihongo@mitvma.mit.edu + +o Nihongo-Hiroba Mailing List + listproc@mcfeeley.cc.utexas.edu + nihongo-hiroba@mcfeeley.cc.utexas.edu + +o Nisus Mailing List + listserv@dartmouth.edu + nisus@dartmouth.edu + +o TLUG (Tokyo Linux User's Group) Mailing List + majordomo@lists.twics.com + tlug@lists.twics.com + +o Unicode Mailing List + unicode-request@unicode.org + unicode@unicode.org + +o WNN User Mailing List + wnn-user-request@wnn.astem.or.jp + wnn-user-jp@wnn.astem.or.jp + +o WWW Multilingual Mailing List + www-mling-request@square.ntt.jp + www-mling@square.ntt.jp + +If the name of the mailing list is part of the subscription address +(such as "easug-request"), the message body should look like this: + + subscribe + +Including your name is optional. If username in the subscription +address is "listserv" or "majordomo" (these are names of mailing list +managing software), the mailing list name must appear after +"subscribe" in the message body as follows: + + subscribe ccnet-l + +Again, including your name is optional. + The following URL has information about Japanese-related +mailing lists: + + gopher://gan1.ncc.go.jp/11/INFO/mail-lists/ + + +A.2: INTERNET RESOURCES + + The Internet provides what I would consider to be the greatest +information resources of all. These can be subcategorized into FTP, +Telnet, Gopher, WWW, and e-mail. + + +A.2.1: USEFUL FTP SITES + + Below are the URLs for useful FTP sites. The directory +specified is the recommended place from which to start poking around +for useful files. + + ftp://cair-archive.kaist.ac.kr/pub/hangul/ + ftp://etlport.etl.go.jp/pub/mule/ + ftp://ftp.adobe.com/pub/adobe/ + ftp://ftp.cc.monash.edu.au/pub/nihongo/ + ftp://ftp.ifcss.org/pub/software/ + ftp://ftp.ora.com/pub/examples/nutshell/ujip/ + ftp://ftp.sra.co.jp/pub/ + ftp://ftp.uwtc.washington.edu/pub/Japanese/ + ftp://kuso.shef.ac.uk/pub/Japanese/ + ftp://unicode.org/pub/ + +This list is expected to grow. + + +A.2.2: USEFUL TELNET SITES + + For those who have a NIFTY-Serve account, there is now a very +convenient way to access NIFTY-Serve using telnet. The URL is as +follows: + + telnet://r2.niftyserve.or.jp/ + +Information about what NIFTY-Serve has to offer (and how to subscribe) +can be found at the following URL: + + http://www.nifty.co.jp/ + + Another information service with a similar access mechanism is +CompuServe, whose URL is as follows: + + telnet://compuserve.com/ + +You will need to press the return key to get the "Host Name:" prompt, +at which time you type "cis" (just follow the menus from this point +on). + You can also do a search on fj.* newsgroup articles at the +following URL: + + telnet://asahi-net.or.jp/ + +You login as "fj-db" once you are connected. + + +A.2.3: USEFUL GOPHER SITES + + I am not too much of a Gopher user. There, of course, is the +following: + + gopher://gopher.ora.com/ + +Another Gopher site provides information on Japanese-related mailing +lists: + + gopher://gan1.ncc.go.jp/11/INFO/mail-lists/ + +If you happen to know of others, please let me know. + + +A.2.4: USEFUL WWW SITES + + Because the World-Wide Web is a constantly changing place (and +more importantly, because I don't want to re-issue a new version of +this document every month!), I will maintain links to useful documents +at my WWW Home Page. Its URL is as follows: + + http://jasper.ora.com/lunde/ + +If you cannot get to my WWW Home Page, you couldn't get to any that I +would list here anyway. + + +A.2.5: USEFUL MAIL SERVERS + + In the past (that is, in JAPAN.INF) I included a full list of +the domains in the "jp" hierarchy. That took up a lot of space, and +changes very rapidly. You can now send a request to a mail server in +order to return the most current listing. The mail server is: + + mail-server@nic.ad.jp + +The most common command is "send," and the following arguments can be +supplied to retrieve specific documents (and should be in the message +body, not on the "Subject:" line): + + send help + send index + send jpnic/domain-list.txt + send jpnic/domain-list-e.txt + +The first sends back a help file, the second sends back a complete +index of files that can be retrieved (use this one to see what other +useful stuff is available), and the last two send back a complete +listing of domains in the "fj" hierarchy (the last one send it back in +English/romanized). + + +A.3: OTHER RESOURCES + + This section provides pointers to specific documentation +available electronically or in print. + + +A.3.1: BOOKS + + There are other useful reference materials available in print +or online, in addition to the various national and international +standards mentioned throughout this document. The following are books +that I recommend for further reading or mental stimulus. (Sorry for +plugging my own books in this list, but they are relevant.) + +o Clews, John. "Language Automation Worldwide: The Development of + Character Set Standards." SESAME Computer Projects. 1988. ISBN + 1-870095-01-4. + +o Flanagan, David. "Java in a Nutshell." O'Reilly & Associates, + Inc. 1996. ISBN 1-56592-183-6. + +o Frisch, AEleen. "Essential System Administration." Second Edition. + O'Reilly & Associates, Inc. 1995. ISBN 1-56592-127-5. + +o Huang, Jack & Timothy Huang. "An Introduction to Chinese, Japanese + and Korean Computing." World Scientific Computing. 1989. ISBN + 9971-50-664-5. + +o IBM Corporation. "Character Data Representation Architecture - Level + 2, Registry." 1993. IBM order number SC09-1391-01. + +o Kano, Nadine. "Developing International Software for Windows 95 and + Windows NT." Microsoft Press. 1995. ISBN 1-55615-840-8. + +o Kirch, Olaf. "Linux Network Administrator's Guide." O'Reilly & + Associates, Inc. 1995. ISBN 1-56592-087-2. + +o Kissell, Joe. "The Nisus Way." MIS:Press. 1996. ISBN 1-55828-455-9. + +o Krol, Ed. "The Whole Internet User's Guide & Catalog." Second + Edition. O'Reilly & Associates, Inc. 1994. ISBN 1-56592-063-5. + +o Liu, Cricket et al. "Managing Internet Information Services." + O'Reilly & Associates, Inc. 1994. ISBN 1-56592-062-7. + +o Lunde, Ken. "Understanding Japanese Information Processing." + O'Reilly & Associates, Incorporated. 1993. ISBN 1-56592-043-0. LCCN + PL524.5.L86 1993. + +o Lunde, Ken. "Nihongo Joho Shori." SOFTBANK Corporation. 1995. ISBN + 4-89052-708-7. + +o Luong, Tuoc V. et al. "Internationalization: Developing Software for + Global Markets." John Wiley & Sons, Incorporated. 1995. ISBN + 0-471-07661-9. + +o Schwartz, Randal L. "Learning Perl." O'Reilly & Associates, + Incorporated. 1993. ISBN 1-56592-042-2. + +o Stallman, Richard M. "GNU Emacs Manual." Tenth edition. Free + Software Foundation. 1994. ISBN 1-882114-04-3. + +o Tuthill, Bill. "Solaris International Developer's Guide." SunSoft + Press and PTR Prentice Hall. 1993. ISBN 0-13-031063-8. + +o Unicode Consortium, The. "The Unicode Standard: Worldwide Character + Encoding." Version 1.0. Volume 2. Addison-Wesley. 1992. ISBN + 0-201-60845-6. + +o Vromans, Johan. "Perl 5 Desktop Reference." O'Reilly & Associates, + Inc. 1996. ISBN 1-56592-187-9. + +o Wall, Larry & Randal L. Schwartz. "Programming Perl." O'Reilly & + Associates, Incorporated. 1991. ISBN 0-937175-64-1. + +o Welsh, Matt & Lar Kaufman. "Running Linux." O'Reilly & Associates, + Inc. 1995. ISBN 1-56592-100-3. + + If you want to get your hands on any of the national or +international standards mentioned in this document, I suggest the +following: + +o The American National Standards Institute can provide ISO, KS, and + JIS standards. Bear in mind that ISO standards will most likely + arrive as a photocopy of the original. + + ANSI + 11 West 42nd Street + New York, NY 10036 + USA + +1-212-642-4900 (phone) + +1-212-302-1286 (facsimile) + +o The International Organization for Standardization can provide + ISO standards. + + ISO + 1, rue de Varemb + Case postale 56 + CH-1211, Geneva 20 + SWITZERLAND + +41-22-749-01-11 (phone) + +41-22-733-34-30 (facsimile) + central@isocs.iso.ch (e-mail) + http://www.iso.ch/ (WWW) + +o Chinese (GB and CNS) standards are the hardest to obtain. It is + quite unfortunate. + + +A.3.2: MAGAZINES + +o "Computing Japan," published monthly, ISSN 1340-7228, + editors@cj.gol.com. + +o "MANGAJIN," published 10 times per year, ISSN 1051-8177. + +o "Multilingual Communications & Computing," published bi-monthly, + ISSN 1065-7657, info@multilingual.com. + +o "The Perl Journal," published quarterly, ISSN 1087-903X, + perl-journal-subscriptions@perl.com. + + +A.3.3: JOURNALS + +o "Chinese Information Processing" (CIP), published bi-monthly, ISSN + 1003-9082. (In Chinese.) + +o "Computer Processing of Chinese & Oriental Languages" (CPCOL), + co-published twice a year by World Scientific Publishing and Chinese + Language Computer Society (CLCS), ISSN 0715-9048. + +o "The Electronic Bodhidharma," published by the International + Research Institute for Zen (IRIZ) Buddhism, Hanazono University, + Japan. More information on the organization that publishes this + journal is available at the following URL: + + http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm + + +A.3.4: RFCs + + Many RFCs (Request For Comments) are relevant to this +document. They are: + +o RFC 1341: "MIME (Multipurpose Internet Mail Extensions): Mechanisms + for Specifying and Describing the Format of Internet Message + Bodies," by Nathaniel Borenstein and Ned Freed, June 1992. + +o RFC 1342: "Representation of Non-ASCII Text in Internet Message + Headers," by Keith Moore, June 1992. + +o RFC 1468: "Japanese Character Encoding for Internet Messages," by + Jun Murai et al., June 1993. + +o RFC 1521: "MIME (Multipurpose Internet Mail Extensions) Part One: + Mechanisms for Specifying and Describing the Format of Internet + Message Bodies," by Nathaniel Borenstein and Ned Freed, September + 1993. Obsoletes RFC 1341. + +o RFC 1522: "MIME (Multipurpose Internet Mail Extensions) Part Two: + Message Header Extensions for Non-ASCII Text," by Keith Moore, + September 1993. Obsoletes RFC 1342. + +o RFC 1554: "ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP," by + Masataka Ohta and Kenichi Handa, December 1993. + +o RFC 1557: "Korean Character Encoding for Internet Messages," by + Uhhyung Choi et al., December 1993. + +o RFC 1642: "UTF-7: A Mail-Safe Transformation Format of Unicode," by + David Goldsmith and Mark Davis, July 1994. + +o RFC 1815: "Character Sets ISO-10646 and ISO-10646-J-1," by Masataka + Ohta, July 1995. + +o RFC 1842: "ASCII Printable Characters-Based Chinese Character + Encoding for Internet Messages," by Ya-Gui Wei et al., August 1995. + +o RFC 1843: "HZ - A Data Format for Exchanging Files of Arbitrarily + Mixed Chinese and ASCII Characters," by Fung Fung Lee, August 1995. + +o RFC 1922: "Chinese Character Encoding for Internet Messages," by + Haifeng Zhu et al., March 1996. + +These RFCs can be obtained from FTP archives that contain all RFC +documents, such as at the following URLs + + ftp://nic.ddn.mil/rfc/ + ftp://ftp.uu.net/inet/rfc/ + +But these specific ones are mirrored at the following URL for +convenience: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/ + + +A.3.5: FAQs + + There are several FAQ (Frequently Asked Questions) files that +provide useful information. The following is a listing of some along +with their URLs: + +o "Japanese Language Information" FAQ (formerly the "sci.lang.japan" + FAQ) by Rafael Santos (santos@mickey.ai.kyutech.ac.jp) at: + + http://www.mickey.ai.kyutech.ac.jp/cgi-bin/japanese/ + + Update announcements are usually posted to the sci.lang.japan + newsgroup. + +o "Programming for Internationalization" FAQ by Michael Gschwind + (mike@vlsivie.tuwien.ac.at) at: + + ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-programming + + Also posted to the comp.software.international newsgroup. This and + other internationalization documents are also accessible through the + following URL: + + http://www.vlsivie.tuwien.ac.at/mike/i18n.html + +o Three FAQs about Internet Service Providers in Japan by Taki Naruto + (tn@panix.com), Jesse Casman (jcasman@unm.edu), and Kenji Yoshida + (kenny@mb.tokyo.infoweb.or.jp), respectively, at: + + http://www.panix.com/~tn/ispj.html + http://nobunaga.unm.edu/internet.html + http://cswww2.essex.ac.uk/users/whean/japan/net.html + +o "Internationalization Reference List" by Eugene Dorr + (gdorr@pgh.legent.com) at: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/i18n-books.txt + + Note really a FAQ, but quite useful because it is a very complete + listing of I18N-related books. + +o "INSOFT-L Service" by Brian Tatro (btatro@tatro.com) at: + + http://iquest.com/~btatro/in2.html + + This includes a link to the FAQ for the INSOFT-L Mailing List (see + Section A.1.2). + +o "How to Use Japanese on the Internet with a PC: From Login to WWW" + by Hideki Hirayama (sgw01623@niftyserve.or.jp) at: + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/faq/jpn-inet.FAQ + +o "Hangul and Internet in Korea" FAQ by Jungshik Shin + (jshin@minerva.cis.yale.edu) at: + + http://pantheon.cis.yale.edu/~jshin/faq/ +--- END (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES --- |