1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
4374
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
4430
4431
4432
4433
4434
4435
4436
4437
4438
4439
4440
4441
4442
4443
4444
4445
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
4457
4458
4459
4460
4461
4462
4463
4464
4465
4466
4467
|
--- BEGIN (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES ---
CJK.INF Version 2.1 (July 12, 1996)
Copyright (C) 1995-1996 Ken Lunde. All Rights Reserved.
CJK is a registered trademark and service mark of The Research
Libraries Group, Inc.
Online Companion to "Understanding Japanese Information Processing"
- ENGLISH: 1993, O'Reilly & Associates, Inc., ISBN 1-56592-043-0
- JAPANESE: 1995, SOFTBANK Corporation, ISBN 4-89052-708-7
This online document provides information on CJK (that is,
Chinese, Japanese, and Korean) character set standards and encoding
systems. In short, it provides detailed information on how CJK text is
handled electronically. I am happy to share this information with
others, and I would appreciate any comments/feedback on its content.
The current version (master copy) of this document is maintained at:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
This file may also be obtained by contacting me directly using one of
the e-mail addresses listed in the CONTACT INFORMATION section.
TABLE OF CONTENTS
VERSION HISTORY
RESTRICTIONS
CONTACT INFORMATION
WHAT HAPPENED TO JAPAN.INF?
DISCLAIMER
CONVENTIONS
INTRODUCTION
PART 1: WHAT'S UP WITH UJIP?
PART 2: CJK CHARACTER SET STANDARDS
2.1: JAPANESE
2.1.1: JIS X 0201-1976
2.1.2: JIS X 0208-1990
2.1.3: JIS X 0212-1990
2.1.4: JIS X 0221-1995
2.1.5: JIS X 0213-199X
2.1.6: OBSOLETE STANDARDS
2.2: CHINESE (PRC)
2.2.1: GB 1988-89
2.2.2: GB 2312-80
2.2.3: GB 6345.1-86
2.2.4: GB 7589-87
2.2.5: GB 7590-87
2.2.6: GB 8565.2-88
2.2.7: GB/T 12345-90
2.2.8: GB/T 13131-9X
2.2.9: GB/T 13132-9X
2.2.10: GB 13000.1-93
2.2.11: ISO-IR-165:1992
2.2.12: OBSOLETE STANDARDS
2.3: CHINESE (TAIWAN)
2.3.1: BIG FIVE
2.3.2: CNS 11643-1992
2.3.3: CNS 5205
2.3.4: OBSOLETE STANDARDS
2.4: KOREAN
2.4.1: KS C 5636-1993
2.4.2: KS C 5601-1992
2.4.3: KS C 5657-1991
2.4.4: GB 12052-89
2.4.5: KS C 5700-1995
2.4.6: OBSOLETE STANDARDS
2.5: CJK
2.5.1: ISO 10646-1:1993
2.5.2: CCCII
2.5.3: ANSI Z39.64-1989
2.6: OTHER
2.6.1: GB 8045-87
2.6.2: TCVN-5773:1993
PART 3: CJK ENCODING SYSTEMS
3.1: 7-BIT ISO 2022 ENCODING
3.1.1: CODE SPACE
3.1.2: ISO-REGISTERED ESCAPE SEQUENCES
3.1.3: ISO-2022-JP AND ISO-2022-JP-2
3.1.4: ISO-2022-KR
3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT
3.2: EUC ENCODING
3.2.1: JAPANESE REPRESENTATION
3.2.2: CHINESE (PRC) REPRESENTATION
3.2.3: CHINESE (TAIWAN) REPRESENTATION
3.2.4: KOREAN REPRESENTATION
3.3: LOCALE-SPECIFIC ENCODINGS
3.3.1: SHIFT-JIS
3.3.2: HZ (HZ-GB-2312)
3.3.3: zW
3.3.4: BIG FIVE
3.3.5: JOHAB
3.3.6: N-BYTE HANGUL
3.3.7: UCS-2
3.3.8: UCS-4
3.3.9: UTF-7
3.3.10: UTF-8
3.3.11: UTF-16
3.3.12: ANSI Z39.64-1989
3.3.13: BASE64
3.3.14: IBM DBCS-HOST
3.3.15: IBM DBCS-PC
3.3.16: IBM DBCS-/TBCS-EUC
3.3.17: UNIFIED HANGUL CODE
3.3.18: TRON CODE
3.3.19: GBK
3.4: CJK CODE PAGES
PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES
4.1: JAPANESE
4.2: CHINESE (PRC)
4.3: CHINESE (TAIWAN)
4.4: KOREAN
4.5: ISO 10646-1:1993
4.6: UNICODE
4.7: CODE CONVERSION TIPS
PART 5: CJK-CAPABLE OPERATING SYSTEMS
5.1: MS-DOS
5.2: WINDOWS
5.3: MACINTOSH
5.4: UNIX AND X WINDOWS
5.5: OTHERS
PART 6: CJK TEXT AND INTERNET SERVICES
6.1: ELECTRONIC MAIL
6.2: USENET NEWS
6.3: GOPHER
6.4: WORLD-WIDE WEB
6.5: FILE TRANSFER TIPS
PART 7: CJK TEXT HANDLING SOFTWARE
7.1: MULE
7.2: CNPRINT
7.3: MASS
7.4: ADOBE TYPE MANAGER (ATM)
7.5: MACINTOSH SOFTWARE
7.6: MACBLUE TELNET
7.7: CXTERM
7.8: UW-DBM
7.9: POSTSCRIPT
7.10: NJWIN
PART 8: CJK PROGRAMMING ISSUES
8.1: C AND C++
8.2: PERL
8.3: JAVA
A FINAL NOTE
ACKNOWLEDGMENTS
APPENDIX A: INFORMATION SOURCES
A.1: USENET NEWSGROUPS AND MAILING LISTS
A.1.1: USENET NEWSGROUPS
A.1.2: MAILING LISTS
A.2: INTERNET RESOURCES
A.2.1: USEFUL FTP SITES
A.2.2: USEFUL TELNET SITES
A.2.3: USEFUL GOPHER SITES
A.2.4: USEFUL WWW SITES
A.2.5: USEFUL MAIL SERVERS
A.3: OTHER RESOURCES
A.3.1: BOOKS
A.3.2: MAGAZINES
A.3.3: JOURNALS
A.3.4: RFCs
A.3.5: FAQs
VERSION HISTORY
The following is a complete listing of the earlier versions of
this document along with their release dates and sizes (in bytes):
Document Version Release Date Size
^^^^^^^^ ^^^^^^^ ^^^^^^^^^^^^ ^^^^
JAPAN.INF 1.0 Unknown Unknown
JAPAN.INF 1.1 08/19/91 101,784
JAPAN.INF 1.2 03/20/92 166,929 (JIS) or 165,639 (Shift-JIS/EUC)
CJK.INF 1.0 06/09/95 103,985
CJK.INF 1.1 06/12/95 112,771
CJK.INF 1.2 06/14/95 125,275
CJK.INF 1.3 06/16/95 130,069
CJK.INF 1.4 06/19/95 142,543
CJK.INF 1.5 06/22/95 146,064
CJK.INF 1.6 06/29/95 150,882
CJK.INF 1.7 08/15/95 153,772
CJK.INF 1.8 09/11/95 157,295
CJK.INF 1.9 12/18/95 170,698
CJK.INF 2.0 03/12/96 175,973
With the release of this version, all of the above are now considered
obsolete. Also, note the three-year gap between the last installment
of JAPAN.INF and the first installment of CJK.INF -- I was writing
UJIP and my PhD dissertation during those three years. Ah, so much for
excuses...
RESTRICTIONS
This document is provided free-of-charge to *anyone*, but no
person or company is permitted to modify, sell, or otherwise
distribute it for profit or other purposes. This document may be
bundled with commercial products only with the prior consent from the
author, and provided that it is not modified in any way whatsoever.
The point here is that I worked long and hard on this document so that
lots of fine folks and companies can benefit from its contents -- not
profit from it.
CONTACT INFORMATION
I would enjoy hearing from readers of this document, even if
it is just to say "hello" or whatever. I can be contacted as follows:
Ken Lunde
Adobe Systems Incorporated
1585 Charleston Road
P.O. Box 7900
Mountain View, CA 94039-7900 USA
415-962-3866 (office phone)
415-960-0886 (facsimile)
lunde@adobe.com (preferred)
lunde@ora.com or ujip@ora.com
WWW Home Page: http://jasper.ora.com/lunde/
If you wonder what I do for my day job, read on.
I have been working for Adobe Systems for over four years now
(before that I was a graduate student at UW-Madison), and my current
position is Project Manager, CJK Type Development.
WHAT HAPPENED TO JAPAN.INF?
Put bluntly, JAPAN.INF died. It first evolved into my first
book entitled "Understanding Japanese Information Processing" (this
book is now into its second printing, and the Japanese translation was
just published). After my book came out, I did attempt to update
JAPAN.INF, but the effort felt a bit futile. I decided that something
fresh was necessary.
JAPAN.INF also evolved into this document, which breaks the
Japanese barrier by providing similar information on Chinese and
Korean character sets and encodings. It fills the Chinese and Korean
gap, so to speak. My specialty (and hobby, believe it or not) is the
field of CJK character sets and encoding systems, so I felt that
shifting this document more towards those lines was appropriate use of
my (copious) free time (I wish there were more than 24 hours in a
day!). Besides, this document now becomes useful to a much broader
audience.
DISCLAIMER
Ah yes, the ever popular disclaimer! Here's mine. Although I
list my address here at Adobe Systems Incorporated for contact
purposes, Adobe Systems does not endorse this document which I have
created, and have continued (and will continue) to update on a regular
basis (uh, yeah, I promise this time!). This document is a personal
endeavor to inform people of how CJK text can be handled on a variety
of platforms.
CONVENTIONS
The notation that is used for detailing Internet resource
information, such as the Internet protocol type, site name, path, and
file follows the URL (Uniform Resource Locator) notation, namely:
protocol://site-name/path/file
An example URL is as follows:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/00README
The protocol is FTP, the site-name is ftp.ora.com, the path is pub/
examples/nutshell/ujip/, and the file is 00README. Also note that this
same notation is used for invoking FTP on WWW (World Wide Web)
browsing software, such as Mosaic, Netscape, or Lynx.
Note that most references to HTTP documents use the four-
letter file extension ".html". However, some HTTP documents are on
file systems that support only three-letter file extensions (can you
say "MS-DOS"?), so you may encounter just ".htm". This is just to let
you know that what you see is not a typo.
References to my book "Understanding Japanese Information
Processing" are (affectionately) abbreviated as UJIP. These references
also apply to the Japanese translation (UJIP-J).
Hexadecimal values are prefixed with 0x, and every two
hexadecimal digits represent a one-byte value. Other values can be
assumed to be in decimal notation.
Chinese characters are referred to as kanji (Japanese), hanzi
(Chinese), or hanja (Korean), depending on context.
References to ISO 10646-1:1993 also refer to Unicode
(usually). I have done this so that I do not have to repeat "Unicode"
in the same context as ISO 10646-1:1993. There are times, however,
when I need to distinguish ISO 10646-1:1993 from Unicode.
INTRODUCTION
Electronic mail (e-mail), just one of the many Internet
resources, has become a very efficient means of communicating both
locally and world-wide. While it is very simple to send text which
uses only the 94 printable ASCII characters, character sets that
contain more than these ASCII characters pose special problems.
This document is primarily concerned with CJK character set
and encoding issues. Much of this sort of information is not easily
obtained. This represents one person's attempt at making such
information more widely available.
PART 1: WHAT'S UP WITH UJIP?
UJIP (First Edition) was published in September 1993 by
O'Reilly & Associates, Incorporated. The second printing (*not* the
Second Edition) was subsequently published in March 1994. The page
count for both printings is unchanged at 470.
The following files contain the latest information about
changes (additions and corrections) made to UJIP and UJIP-J for
various printings, both for those that have taken place (such as for
the second printing of the English edition) and for those that are
planned (the first digit is the edition, and the second is the
printing):
ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-2.txt
ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-3.txt
ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-j-errata-1-2.txt
I *highly* recommend that all readers of UJIP obtain these errata
files. Those without FTP access can request copies directly from me.
The Japanese translation of UJIP (UJIP-J), co-published by
O'Reilly & Associates, Incorporated and SOFTBANK Corporation, was just
released. The translation was done by my good friend Jack Halpern,
along with one of his colleagues, Takeo Suzuki. The Japanese edition
incorporates corrections and updates not yet found in the English
edition. The page count is 535.
Late-breaking news! I am currently working on UJIP Second
Edition (to be retitled as "Understanding CJK Information Processing"
and abbreviated UCJKIP). If all goes well, it should be available by
January 1997, and will be well over 700 pages. If there was something
you wanted to see in UJIP, now's your chance to send me a request...
PART 2: CJK CHARACTER SET STANDARDS
These sections describe the character sets used in Japan,
China (PRC and Taiwan), and Korea. Exact numbers of characters are
provided for each character set standard (when known), as well as
tidbits of information not otherwise available. This provides the
basic foundations for understanding how CJK scripts are handled on
computer systems.
The two basic types of characters enumerated by CJK character
set standards are Chinese characters (kanji, hanzi, or hanja), which
number in the thousands (and, in some cases, tens of thousands), and
characters other than Chinese characters (symbols, numerals, kana
hangul, alphabets, and so on), which usually number in the hundreds
(there are thousands of pre-combined hangul, though).
If you happen to be running X Windows, it is very easy to
display these CJK character sets (if a bitmapped font for the
character set exists, that is). Here is what I usually do:
o Obtain a BDF (Bitmap Distribution Format) font for the target
character set. Try the following URLs for starters:
ftp://cair-archive.kaist.ac.kr/pub/hangul/fonts/
ftp://etlport.etl.go.jp/pub/mule/fonts/
ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/bdf/
ftp://ftp.kuis.kyoto-u.ac.jp/misc/fonts/jisksp-fonts/
ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/fonts/
ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/
ftp://ftp.technet.sg:/pub/chinese/fonts/
http://ccic.ifcss.org/www/pub/software/fonts/
BDF files usually have the string "bdf" somewhere in their file
name, usually at the end. If the file is compressed (noticing that
it ends in .gz or .Z is a good indication), decompress it. BDF files
are text files.
o Convert the BDF file to SNF (Server Natural Format) or PCF (Portable
Compiled Format) using the programs "bdftosnf" or "bdftopcf,"
respectively. Example command lines are as follows:
% bdftopcf jiskan16-1990.bdf > k16-90.pcf
% bdftosnf jiskan16-1990.bdf > k16-90.snf
SNF files (and the "bdftosnf" program) are used on X11R4 and
earlier, and PCF files (and the "bdftopcf" program) are used on
X11R5 and later.
o Copy the SNF or PCF file to a directory in the font search path (or
make a new path). Supposing I made a new directory called "fonts" in
my home directory, I then run "mkfontdir" on the directory
containing the SNF or PCF files as follows:
% mkfontdir ~/fonts
This creates a fonts.dir file in ~/fonts. I can now add this
directory to my font search path with the following command:
% xset +fp ~/fonts
o The command "xfd" (X Font Displayer) with the "-fn" switch followed
by a font name then invokes a window that displays all the
characters of the font. In the case of two-byte (CJK) fonts, one row
is displayed at a time. The following is an example command line:
% xfd -fn -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0
You can create a "fonts.alias" file in the same directory as the
"fonts.dir" file in order to shorten the name when accessing the
font. The alias "k16-90" could be used instead if the content of the
fonts.alias file is as follows:
k16-90 -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0
Don't forget to execute the following command in order to make the
X Font Server aware of the new alias:
% xset fp rehash
Now you can use a simpler command line for "xfd" as follows:
% xfd -fn k16-90
The "X Window System User's Guide" (Volume 3 of the X Window
System series by O'Reilly & Associates, Inc.) provides detailed
information on managing fonts under X Windows (pp 123-160). The
article entitled "The X Administrator: Font Formats and Utilities" (pp
14-34 in "The X Resource," Issue 2), describes the BDF, SNF, and PCF
formats in great detail.
There is another bitmap format called HBF (Hanzi Bitmap
Format), which is similar to BDF, but optimized for fixed-width
(monospaced) fonts. It is described in the article entitled "The HBF
Font Format: Optimizing Fixed-pitch Font Support" (pp 113-123 in "The
X Resource," Issue 10), and also at the following URL:
ftp://ftp.ifcss.org/pub/software/fonts/hbf-discussion/
HBF fonts can be found at the following URL:
ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/hbf/
Lastly, you may wish to check out my newly-developed CJK
Character Set Server, which generates various CJK character sets with
proper encoding applied. It is written in Perl, and accessed through
an HTML form. This server can be considered an upgrade to my JChar
tool (written in C). The URL is:
http://jasper.ora.com/lunde/cjk-char.html
2.1: JAPANESE
All (national) character set standards that originate in Japan
have names that begin with the three letters JIS. JIS is short for
"Japanese Industrial Standard." But it is JSA (Japanese Standards
Association) who publishes the corresponding manuals. Chapter 3 and
Appendixes H and J of UJIP provide more detailed information on
Japanese character set standards.
2.1.1: JIS X 0201-1976
JIS X 0201-1976 (formerly JIS C 6220-1969; reaffirmed in 1989;
and its revision [with no character set changes] is currently under
public review) enumerates two sets of characters: JIS-Roman and
half-width katakana.
JIS-Roman is the Japanese equivalent of the ASCII character
set, namely 128 characters consisting of the following:
o 10 numerals
o 52 uppercase and lowercase characters of the Latin alphabet
o 32 symbols (punctuation and so on)
o 34 non-printing characters (white space and control characters)
The term "white space" refers to characters that occupy space, but
have no appearance, such as tabs, spaces, and termination characters
(line feed, carriage return, and form feed).
So, how are JIS-Roman and ASCII different? The following
three codes are (usually) different:
Code ASCII JIS-Roman
^^^^ ^^^^^ ^^^^^^^^^
0x5C backslash yen symbol
0x7C broken bar bar
0x7E tilde overbar
Half-width katakana consists of 63 characters that provide a
minimal set of characters necessary for expressing Japanese. The
shapes are compressed, and visually occupy a space half that of
*normal* Japanese characters.
2.1.2: JIS X 0208-1990
This basic Japanese character set standard enumerates 6,879
characters, 6,355 of which are kanji separated into two levels. Kanji
in the first level are arranged by (most frequent) reading, and those
in the second level are arranged by radical then total number of
(remaining) strokes.
o Row 1: 94 symbols
o Row 2: 53 symbols
o Row 3: 10 numerals and 52 uppercase and lowercase Latin alphabet
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 32 line-drawing elements
o Rows 16 through 47: 2,965 kanji (JIS Level 1 Kanji; last is 47-51)
o Rows 48 through 84: 3,390 kanji (JIS Level 2 Kanji; last is 84-06)
Appendix B of UJIP provides a complete illustration of the JIS X
0208-1990 character set standard by KUTEN (row-cell) code. Appendix G
(pp 294-317) of "Developing International Software for Windows 95 and
Windows NT" by Nadine Kano illustrates the JIS X 0208-1990 character
set standard plus the Microsoft extensions by Shift-JIS code
(Microsoft calls this Code Page 932).
Earlier versions of this standard were dated 1978 (JIS C
6226-1978) and 1983 (JIS X 0208-1983, formerly JIS C 6226-1983).
JIS X 0208 went through a revision (from November 1995 until
February 1996), and is slated for publication sometime in 1996 (to
become JIS X 0208-1996). More information on this revision is
available at the following URL:
ftp://ftp.tiu.ac.jp/jis/jisx0208/
2.1.3: JIS X 0212-1990
This supplemental Japanese character set standard enumerates
6,067 characters, 5,801 of which are kanji ordered by radical then
total number of (remaining) strokes. All 5,801 kanji are unique when
compared to those in JIS X 0208-1990 (see Section 2.1.2). The
remaining 266 characters are categorized as non-kanji.
o Row 2: 21 diacritics and symbols
o Row 6: 21 Greek characters with diacritics
o Row 7: 26 Eastern European characters
o Rows 9 through 11: 198 alphabetic characters
o Rows 16 through 77: 5,801 kanji (last is 77-67)
Appendix C of UJIP provides a complete illustration of the JIS X
0212-1990 character set standard by KUTEN (row-cell) code.
The only commercial operating system that provides JIS X
0212-1990 support is BTRON by Personal Media Corporation:
http://www.personal-media.co.jp/
Section 3.3.18 provides information about TRON Code (used by BTRON),
and details how it encodes the JIS X 0212-1990 character set.
2.1.4: JIS X 0221-1995
This document is, for all practical purposes, the Japanese
translation of ISO 10646-1:1993 (see Section 2.5.1). Like ISO
10646-1:1993, it is based on Unicode Version 1.1.
It is noteworthy that JIS X 0221-1995 enumerates subsets that
are applicable for Japanese use (a brief description of their contents
in parentheses):
o BASIC JAPANESE (JIS X 0208-1990 and JIS X 0201-1976 -- characters
that can be created by means of combining are not included -- 6,884
characters)
o JAPANESE NON IDEOGRAPHICS SUPPLEMENT (1,913 characters: all non-
kanji of JIS X 0212-1990 plus hundreds of non-JIS characters)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 1 (918 frequently-used kanji from
JIS X 0212-1990, including 28 that are identical to kanji forms in
JIS C 6226-1978)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 2 (the remainder of JIS X 0212-
1990, namely 4,883 kanji)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 3 (the remaining kanji of ISO
10646-1:1993, namely 8,746 characters)
o FULLWIDTH ALPHANUMERICS (94 characters; for compatibility)
o HALFWIDTH KATAKANA (63 characters; for compatibility)
Pages 893 through 993 provide Kangxi Zidian (a classic
300-year-old Chinese character dictionary containing approximately
50,000 characters) and Dai Kanwa Jiten (also known as Morohashi)
indexes for the entire Chinese character block, namely from 0x4E00
through 0x9FA5.
At 25,750 Yen, it is actually cheaper than ISO 10646-1:1993!
2.1.5: JIS X 0213-199X
I recently became aware that JSA plans to publish an extension
to JIS X 0208, containing approximately 2,000 characters (kanji and
non-kanji). A public review of this new standard is planned for Summer
1996. I would expect that its information will eventually be available
at the following URL:
ftp://ftp.tiu.ac.jp/jis/
2.1.6: OBSOLETE STANDARDS
JIS C 6226-1978 and JIS X 0208-1983 (formerly JIS C 6226-1983)
have been superseded by JIS X 0208-1990. Section 4.1 provides details
on the changes made between these earlier versions of JIS X 0208.
JIS X 0221-1995 does not mean the end of JIS X 0201-1976, JIS
X 0208-1990, and JIS X 0212-1990. Instead, it will co-exist with those
standards.
2.2: CHINESE (PRC)
All character set standards that originate in PRC have
designations that begin with "GB." "GB" is short for "Guo Biao" (which
is, in turn, short for "Guojia Biaojun") and means "National
Standard." A select few also have "/T" attached. The "T" presumably is
short for "Traditional." Section 2.2.11 describes ISO-IR-165:1992,
which is a variant of GB 2312-80. It is included here because of this
relationship.
Most people correlate GB character set standards with
simplified Chinese, but as you will see below, that is not always the
case.
There are three basic character sets, each one having a
simplified and traditional version.
Character Set Set Number Character Forms
^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^^^^
GB 2312-80 0 Simplified
GB/T 12345-90 1 Traditional of GB 2312-80
GB 7589-87 2 Simplified
GB/T 13131-9X 3 Traditional of GB 7589-87
GB 7590-87 4 Simplified
GB/T 13132-9X 5 Traditional of GB 7590-87
2.2.1: GB 1988-89
This character set, formerly GB 1988-80 and sometimes referred
to as GB-Roman, is the Chinese analog to ASCII and ISO 646. The main
difference is that the currency symbol (0x24), which is represented as
a dollar sign ($) in ASCII, is represented as a Chinese Yuan
(currency) symbol instead. GB 1988-89 is sometimes referred to as
GB-Roman.
2.2.2: GB 2312-80
This basic (simplified) Chinese character set standard
enumerates 7,445 characters, 6,763 of which are hanzi separated into
two levels. Hanzi in the first level are arranged by reading, and
those in the second level are arranges by radical then total number of
(remaining) strokes. GB 2312-80 is also known as the "Primary Set,"
GB0 (zero), or just GB.
o Row 1: 94 symbols
o Row 2: 72 numerals
o Row 3: 94 full-width GB 1988-89 characters (see Section 2.2.1)
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 26 Pinyin and 37 Bopomofo characters
o Row 9: 76 line-drawing elements (09-04 through 09-79)
o Rows 16 through 55: 3,755 hanzi (Level 1 Hanzi; last is 55-89)
o Rows 56 through 87: 3,008 hanzi (Level 2 Hanzi; last is 87-94)
Compare some of the structure with JIS X 0208-1990, and you will find
many similarities, such as:
o Hiragana, katakana, Greek, and Cyrillic characters are in Rows 4, 5,
6, and 7, respectively
o Chinese characters begin at Row 16
o Chinese characters are separated into two levels
o Level 1 arranged by reading
o Level 2 arranged by radical then total number of strokes
The Japanese standard, JIS C 6226-1978, came out in 1978, which means
that it pre-dates GB 2312-80. The above similarities could not be by
coincidence, but rather by design.
Appendix G (pp 318-344) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the GB 2312-
80 character set standard by EUC code (Microsoft calls this Code Page
936). Code Page 936 incorporates the correction of the hanzi at 79-81,
and the correction of the order of 07-22 and 07-23 (see Section 2.2.3
for more details).
2.2.3: GB 6345.1-86
This document specifies corrections and additions to GB
2312-80 (see Section 2.2.2). The following is a detailed enumeration
of the changes:
o The form of "g" in Row 3 (position 71) was altered
o Row 8 has six additional Pinyin characters (08-27 through 08-32)
o Row 10 contains half-width versions of Row 3 (94 characters)
o Row 11 contains half-width versions of the Pinyin characters from
Row 8 (32 characters; 11-01 through 11-32)
o The hanzi at 79-81 was corrected to have a simplified left-side
radical (this was an error in GB 2312-80)
Note that these changes affect the total number of characters in GB
2312-80 -- an increase of 132 characters. This now makes 7,577 as
the total number of characters in GB 2312-80 (7,445 plus 132).
There was, however, an undocumented correction made in GB
6345.1-86. The order of characters 07-22 and 07-23 (uppercase
Cyrillic) were reversed. This error is apparently in the first and
perhaps second printing of the GB 2312-80 manual, because the copy I
have is from the third printing, and this has been corrected. Page 145
(Figure 113) of John Clews' "Language Automation Worldwide: The
Development of Character Set Standards" illustrates this error.
Developers should take special note of this -- I have seen GB 2312-80
based font products that propagate this ordering error.
2.2.4: GB 7589-87
This character set enumerates 7,237 hanzi in Rows 16 through
92 (last is 92-93), and they are ordered by radical then total number
of (remaining) strokes. GB 7589-87 is also known as the "Second
Supplementary Set" or GB2.
2.2.5: GB 7590-87
This character set enumerates 7,039 hanzi in Rows 16 through
90 (last is 90-83), and they are ordered by radical then total number
of (remaining) strokes. GB 7590-87 is also known as the "Fourth
Supplementary Set" or GB4.
2.2.6: GB 8565.2-88
This standard makes additions to GB 2312-80 (these additions
are separate from those made in GB 6345.1-86 described in Section
2.2.3). GB 8565.2-88 is also known as GB8. In this case there are 705
additions, indicated as follows:
o Row 13 contains 50 hanzi from GB 7589-87 (last is 13-50)
o Row 14 contains 92 hanzi from GB 7590-87 (last is 14-92)
o Row 15 contains 69 non-hanzi indicating dates and times, plus 24
miscellaneous hanzi (for personal/place names and radicals; last is
15-93).
o Rows 90 through 94 contain 470 hanzi from GB 7589-87 (94 each)
GB 8565.2-88 therefore provides a total of 8,150 characters (7,445
plus 705).
2.2.7: GB/T 12345-90
This character set is nearly identical to GB 2312-80 (see
Section 2.2.2) in terms of the number and arrangement of characters,
but simplified hanzi are replaced by their traditional versions. GB/T
12345-90 is also known as the "Supplementary Set" or GB1.
The following are some interesting facts about this character
set (some instances of simplified/traditional pairs that appear below
are actually character form differences):
o 29 vertical-use characters (punctuation and parentheses) included in
Row 6 (06-57 through 06-85).
o 2,118 traditional hanzi replace simplified hanzi in Rows 16 through
87. The "G1-Unique" appendix of the unofficial version (supplied to
the CJK-JRG for Han Unification purposes) is missing the following
four (specifies only 2,114):
0x5B3B 0x6D2F
0x5E7C 0x6F71
But, ISO 10646-1:1993 ended up getting these hanzi included anyway,
with correct mappings.
o Four simplified/traditional hanzi pairs (eight affected code points)
in rows 16 through 87 are swapped:
0x3A73 <-> 0x6161
0x5577 <-> 0x6167
0x5360 <-> 0x6245 (see the next bullet)
0x4334 <-> 0x7761
o One hanzi (0x6245), after being swapped, had its left-side radical
unsimplified (this character, now at 0x5360, is considered part of
the 2,118 traditional hanzi from the second bullet):
0x6245 -> 0x5360
o 103 hanzi included in Rows 88 (94 characters) and 89 (9 characters;
89-01 through 89-09). These are all related to characters between
Rows 16 and 87.
- 41 simplified hanzi from Rows 16 through 87 moved to Rows 88 and
89 (traditional hanzi are now at the original code points):
0x3327 -> 0x7827 0x3E5D -> 0x7846 0x4B49 -> 0x7869
0x3365 -> 0x7828 0x3F64 -> 0x7849 0x4C28 -> 0x786B
0x3373 -> 0x7829 0x402F -> 0x784B 0x4D3F -> 0x786F
0x3533 -> 0x782C 0x4030 -> 0x784C 0x4D72 -> 0x7871
0x356D -> 0x782D 0x406F -> 0x784E 0x5236 -> 0x7878
0x3637 -> 0x782F 0x4131 -> 0x7850 0x5374 -> 0x7879
0x3736 -> 0x7832 0x463B -> 0x785C 0x5438 -> 0x787C
0x3761 -> 0x7833 0x463E -> 0x785D 0x5446 -> 0x787D
0x3849 -> 0x7835 0x464B -> 0x785E 0x5622 -> 0x7921
0x3963 -> 0x7838 0x464D -> 0x785F 0x563B -> 0x7923
0x3B2E -> 0x783B 0x4653 -> 0x7860 0x5656 -> 0x7926
0x3C38 -> 0x7840 0x4837 -> 0x7866 0x567E -> 0x7928
0x3C5B -> 0x7842 0x4961 -> 0x7867 0x573C -> 0x7929
0x3C76 -> 0x7843 0x4A75 -> 0x7868
- 62 hanzi added to Rows 88 and 89 (the gaps from the above are
filled in). These were mostly to account for multiple traditional
hanzi collapsing into a single simplified form.
- The following code point mappings illustrate how all of these 103
hanzi are related to hanzi between Rows 16 and 87 (note how many
of these 103 hanzi map to a single code point):
0x7821 -> 0x305A 0x7844 -> 0x3D2A 0x7867 -> 0x4961
0x7822 -> 0x3065 0x7845 -> 0x3E21 0x7868 -> 0x4A75
0x7823 -> 0x316D 0x7846 -> 0x3E5D 0x7869 -> 0x4B49
0x7824 -> 0x3170 0x7847 -> 0x3E6D 0x786A -> 0x4B55
0x7825 -> 0x3237 0x7848 -> 0x3F4B 0x786B -> 0x4C28
0x7826 -> 0x3245 0x7849 -> 0x3F64 0x786C -> 0x4C28
0x7827 -> 0x3327 0x784A -> 0x4027 0x786D -> 0x4C28
0x7828 -> 0x3365 0x784B -> 0x402F 0x786E -> 0x4C33
0x7829 -> 0x3373 0x784C -> 0x4030 0x786F -> 0x4D3F
0x782A -> 0x3376 0x784D -> 0x405B 0x7870 -> 0x4D45
0x782B -> 0x3531 0x784E -> 0x406F 0x7871 -> 0x4D72
0x782C -> 0x3533 0x784F -> 0x407A 0x7872 -> 0x4F35
0x782D -> 0x356D 0x7850 -> 0x4131 0x7873 -> 0x4F35
0x782E -> 0x362C 0x7851 -> 0x414B 0x7874 -> 0x4F4C
0x782F -> 0x3637 0x7852 -> 0x4231 0x7875 -> 0x4F72
0x7830 -> 0x3671 0x7853 -> 0x425E 0x7876 -> 0x506B
0x7831 -> 0x3722 0x7854 -> 0x4339 0x7877 -> 0x5229
0x7832 -> 0x3736 0x7855 -> 0x4349 0x7878 -> 0x5236
0x7833 -> 0x3761 0x7856 -> 0x4349 0x7879 -> 0x5374
0x7834 -> 0x3834 0x7857 -> 0x4349 0x787A -> 0x5379
0x7835 -> 0x3849 0x7858 -> 0x4356 0x787B -> 0x5375
0x7836 -> 0x3948 0x7859 -> 0x4366 0x787C -> 0x5438
0x7837 -> 0x394E 0x785A -> 0x436F 0x787D -> 0x5446
0x7838 -> 0x3963 0x785B -> 0x3159 0x787E -> 0x5460
0x7839 -> 0x6358 0x785C -> 0x463B 0x7921 -> 0x5622
0x783A -> 0x3A7A 0x785D -> 0x463E 0x7922 -> 0x563B
0x783B -> 0x3B2E 0x785E -> 0x464B 0x7923 -> 0x563B
0x783C -> 0x3B58 0x785F -> 0x464D 0x7924 -> 0x5642
0x783D -> 0x3B63 0x7860 -> 0x4653 0x7925 -> 0x5646
0x783E -> 0x3B71 0x7861 -> 0x4727 0x7926 -> 0x5656
0x783F -> 0x3C22 0x7862 -> 0x4729 0x7927 -> 0x566C
0x7840 -> 0x3C38 0x7863 -> 0x4F4B 0x7928 -> 0x567E
0x7841 -> 0x3C52 0x7864 -> 0x476F 0x7929 -> 0x573C
0x7842 -> 0x3C5B 0x7865 -> 0x477A
0x7843 -> 0x3C76 0x7866 -> 0x4837
So, if we total everything up, we see that GB/T 12345-90 has 2,180
hanzi (2,118 are replacements for GB 2312-80 code points, and 62 are
additional) and 29 non-hanzi not found in GB 2312-80.
Note that the printing of the GB/T 12345-90 has some
character-form errors. The errors I am aware of are as follows:
Code Point Description of Error
^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
0x4125 The upper-left element should be "tree" instead of
"warrior"
0x596C The "bird" radical should not include the "fire" element
2.2.8: GB/T 13131-9X
This character set is identical to GB 7589-87 (see Section
2.2.4) in terms of number of characters, but simplified hanzi are
replaced by their traditional versions. The exact number of such
substitutions is currently unknown to this author. GB/T 13131-9X is
also known as the "Third Supplementary Set" or GB3.
2.2.9: GB/T 13132-9X
This character set is identical to GB 7590-87 (see Section
2.2.5) in terms of number of characters, but simplified hanzi are
replaced by their traditional versions. The exact number of such
substitutions is currently unknown to this author. GB/T 13132-9X is
also known as the "Fifth Supplementary Set" or GB5.
2.2.10: GB 13000.1-93
This document is, for all practical purposes, the Chinese
translation of ISO 10646-1:1993 (see Section 2.5.1).
2.2.11: ISO-IR-165:1992
This standard, also known as the CCITT Chinese Set, is a
variant of GB 2312-80 with the following characteristics:
o GB 6345.1-86 modifications (including the undocumented one) and
additions, namely 132 characters (see Section 2.2.3)
o GB 8565.2-88 additions, namely 705 characters (see Section 2.2.6)
o Row 6 contains 22 background (shading) characters (06-60 through
06-81)
o Row 12 contains 94 hanzi
o Row 13 contains 44 additional hanzi (13-51 through 13-94; fills the
row)
o Row 15 contains 1 additional hanzi (15-94)
ISO-IR-165:1992 can therefore be considered a superset of GB 2312-80,
GB 6345.1-86, and GB 8565.2-88. This means 8,443 total characters
compared to the 7,445 in GB 2312-80, 7,577 in GB 6345.1-86, and the
8,150 in GB 8565.2-88.
2.2.12: OBSOLETE STANDARDS
Most GB standards seem to be revised through other documents,
so it is hard to point to a standard and claim that it is obsolete.
The only revision I am aware of is the GB 1988-89 (the original was
named GB 1988-80).
2.3: CHINESE (TAIWAN)
The sections below describe two major Taiwanese character
sets, namely Big Five and CNS 11643-1992. As you will learn they are
somewhat compatible. CCCII, also developed in Taiwan, is described in
Section 2.5.2.
2.3.1: BIG FIVE
The Big Five character set is composed of 94 rows of 157
characters each (the 157 characters of each row are encoded in an
initial group of 63 codes followed by the remaining 94 codes). The
following is a break-down of its contents:
o Row 1: 157 symbols
o Row 2: 157 symbols
o Row 3: 94 symbols
o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63)
o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116)
This forms what I consider to be the basic Big Five set. Actually, two
of the hanzi in Level 2 are duplicates, so there are actually only
7,650 unique hanzi in Level 2.
There are two major extensions to Big Five. The first really
has no name, and can be considered part of the basic Big Five set as
specified above. It adds the following characters:
o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66
uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled
digits, and 10 parenthesized digits
The other extension was developed by a company called ETen
Information System in Taiwan, and is actually considered to be the
most widely used version of Big Five. It provides the following
extensions to Big Five (different from the above extension):
o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase
Roman numerals, 25 classical radicals, 15 Japanese-specific symbols,
83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic
(Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40
fraction-like digits, and 7 symbols
o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black
box
It is *very* important to note that while these two extensions
have many common portions (in particular, hiragana, katakana, the
Cyrillic alphabet, and so on), they do not share the same code points
for such characters.
Appendix G (pp 407-450) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the Big Five
character set standard by Big Five code (Microsoft calls this Code
Page 950). Code Page 950 incorporates some of the ETen extensions,
namely those in Row 89.
2.3.2: CNS 11643-1992
CNS 11643-1992 (also known as CNS 11643 X 5012), by
definition, consists of 16 planes of characters, seven of which have
character assignments. Each plane is a 94-row-by-94-cell matrix
capable of holding a total of 8,836 characters. CNS stands for
"Chinese National Standard."
CNS 11643-1992 specifies characters only in the first seven
planes. A break-down of characters, by plane, is as follows:
o Plane 1:
- 438 symbols in Rows 1 through 6
- 213 classical radicals in Rows 7 through 9
- 33 graphic representations of control characters in Row 34
- 5,401 hanzi in Rows 36 through 93 (last is 93-43)
o Plane 2: 7,650 hanzi in Rows 1 through 82 (last is 82-36)
o Plane 3: 6,148 hanzi in Rows 1 through 66 (last is 66-38)
o Plane 4: 7,298 hanzi in Rows 1 through 78 (last is 78-60)
o Plane 5: 8,603 hanzi in Rows 1 through 92 (last is 92-49)
o Plane 6: 6,388 hanzi in Rows 1 through 68 (last is 68-90)
o Plane 7: 6,539 hanzi in Rows 1 through 70 (last is 70-53)
The total number of characters in CNS 11643-1992 is a staggering
48,711 characters, 48,027 of which are hanzi. Also note that number of
hanzi in Plane 1 is identical to Level 1 hanzi of Big Five (see
Section 2.3.1). The 2 extra hanzi in Level 2 hanzi of Big Five are
actually redundant, and are therefore not in CNS 11643-1992 Plane 2.
It is rumored that Plane 8 is currently being defined, and
will add yet more hanzi to this standard.
2.3.3: CNS 5205
This character set is Taiwan's analog to ASCII and ISO 646,
and is reportedly rarely used. How it differs from ASCII, if at all,
is unknown to this author.
2.3.4: OBSOLETE STANDARDS
CNS 11643-1986 specified characters only in the first three
planes, as described in Section 2.3.2. Also, Plane 3 of CNS 11643-1992
was called Plane 14 of CNS 11643-1986.
2.4: KOREAN
The sections below describe the most current Korean character
sets, namely KS C 5636-1993, KS C 5601-1992, KS C 5657-1991, and KS C
5700-1995. "KS" stands for "Korean Standard."
2.4.1: KS C 5636-1993
This character set (published on January 6, 1993), formerly KS
C 5636-1989 (published on April 22, 1989) and sometimes referred to as
KS-Roman, is the Korean analog to ASCII and ISO 646-1991. The primary
difference is that the ASCII backslash (0x5C) is represented as a Won
symbol.
2.4.2: KS C 5601-1992
This basic Korean character set standard enumerates 8,224
characters, 4,888 of which are hanja, and 2,350 of which are pre-
combined hangul. The hanja and hangul blocks are arranged by reading.
The following is a break-down of its contents:
o Row 1: 94 symbols
o Row 2: 69 abbreviations and symbols
o Row 3: 94 full-width KS C 5636-1993 characters (see Section 2.4.1)
o Row 4: 94 hangul elements
o Row 5: 68 lowercase and uppercase Roman numerals and lowercase and
uppercase Greek alphabet
o Row 6: 68 line-drawing elements
o Row 7: 79 abbreviations
o Row 8: 91 phonetic symbols, circled characters, and fractions
o Row 9: 94 phonetic symbols, parenthesized characters, subscripts,
and superscripts
o Row 10: 83 hiragana
o Row 11: 86 katakana
o Row 12: 66 lowercase and uppercase Cyrillic (Russian) alphabet
o Rows 16 through 40: 2,350 pre-combined hangul (last is 40-94)
o Rows 42 through 93: 4,888 hanja (last is 93-94)
Rows 41 and 94 are designated for user-defined characters.
There are many similarities with JIS X 0208-1990 and GB
2312-80, such as hiragana, katakana, Greek, and Cyrillic characters,
but they are assigned to different rows.
There is an interesting note about the hanja block (Rows 42
through 93). Although there are 4,888 hanja, not all are unique. The
hanja block is arranged by reading, and in those cases when a hanja
has more than one reading, that hanja is duplicated (sometimes more
than once) in the same character set. There are 268 such cases of
duplicate hanja in KS C 5601-1992, meaning that it contains 4,620
unique hanja. If you have a copy of the KS C 5601-1992 manual handy,
you can compare the following four code points:
0x6445
0x5162
0x5525
0x6879
While most of these cases involve two hanja instances, there are four
hanja that have three instances, and one (listed above) that has four!
This is the only CJK character set that has this property of
intentionally duplicating Chinese characters. See Section 4.4 for more
details.
Annex 3 of this standard defines the complete set of 11,172
pre-combined hangul characters, also known as Johab. Johab refers to
the encoding method, and is almost like encoding all possible three-
letter words (meaning that most are nonsense). See Section 3.3.5 for
more details on Johab encoding.
2.4.3: KS C 5657-1991
This character set standard provides supplemental characters
for Korean writing, to include symbols, pre-combined hangul, and
hanja. The following is a break-down of its contents:
o Rows 1 through 7: 613 lowercase and uppercase Latin characters with
diacritics (see note below)
o Rows 8 through 10: 273 lowercase and uppercase Greek characters with
diacritics
o Rows 11 through 13: 275 symbols
o Row 14: 27 compound hangul elements
o Rows 16 through 36: 1,930 pre-combined hangul (last is 36-50)
o Rows 37 through 54: 1,675 pre-combined hangul (last is 54-77; see
note below)
o Rows 55 through 85: 2,856 hanja (last is 85-36)
The KS C 5657-1991 manual has a possible error (or at least an
inconsistency) for Rows 1 through 7. The manual says that there are
615 characters in that range, but I only counted 613. The difference
can be found on page 19 as the following two characters:
Character Code Character
^^^^^^^^^^^^^^ ^^^^^^^^^
0x2137 X
0x217A TM
An "X" doesn't belong there (it is already in KS C 5601-1992 at code
point 0x2358), and the trademark symbol is also part of KS C 5601-1992
at code point 0x2262. This is why I feel that my count of 613 is more
accurate than what is explicitly stated in the manual on page 2.
Also, page 2 of the manual says that Rows 37 through 54
contains 1,677 pre-combined hangul, but I only counted 1,675 (17 rows
of 94 characters plus a final row with 77 characters -- do the math
for yourself).
Here's another interesting note. My official copy of this
standard has all of its 2,856 hanja hand-written.
2.4.4: GB 12052-89
You may be asking yourself why a GB standard is listed under
the Korean section of this document. Well, there is a rather large
Korean population in China (Korea was considered part of China before
the 1890s), and they need a character set standard for communicating
using hangul. GB 12052-89 is a Korean character set standard
established by China (PRC), and enumerates a total of 5,979
characters.
The following is the arrangement of this character set:
o Row 1: 94 symbols
o Row 2: 72 numerals
o Row 3: 94 full-width ASCII characters
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 26 Pinyin and 37 Bopomofo characters
o Row 9: 76 line-drawing elements (09-04 through 09-79)
o Rows 16 through 37: 2,068 pre-combined hangul (Level 1 Hangul, Part
1; last is 37-94)
o Rows 38 through 52: 1,356 pre-combined hangul (Level 1 Hangul, Part
2; last is 52-40)
o Rows 53 through 71: 1,779 pre-combined hangul (Level 2 Hangul; last
is 71-87)
o Rows 71 through 72: 94 "Idu" hanja (71-89 through 72-88)
There are a few interesting notes I can make about this
character set:
o Rows 1 through 9 are identical to the same rows in GB 2312-80,
except that 03-04 is a dollar sign, not a Chinese Yuan (currency)
symbol.
o The GB 12052-89 manual states on pp 1 and 3 that Rows 53 through 72
contain 1,876 characters, but I only counted 1,873 (1,779 hangul
plus 94 hanja).
o The total number of characters, 5,979, is correctly stated in the
manual although the hangul count is incorrect.
o The arrangement and ordering of these hangul bear no relationship to
that of KS C 5601-1992. Both standards order by reading, which is
the only way in which they are similar.
I am not aware to what extent this character set is being
used (and who might be using it).
2.4.5: KS C 5700-1995
Korea has developed a new character set standard called KS C
5700-1995. It is equivalent to ISO 10646-1:1993, but have pre-combined
hangul as provided (and ordered) in Unicode Version 2.0 (meaning that
all 11,172 hangul are in a contiguous block).
2.4.6: OBSOLETE STANDARDS
KS C 5601-1986, KS C 5601-1987, and KS C 5601-1989 are the
same, character-set wise, to KS C 5601-1992. The 1992 edition provides
more material in the form of annexes. KS C 5601-1982, the original
version, enumerated only the 51 basic hangul elements in a one-byte 7-
and 8-bit encoding. This information is still part of KS C 5601-1992,
but in Annex 4.
There were two earlier multiple-byte standards called KS C
5619-1982 and KIPS. KS C 5619-1982 enumerated 51 hangul elements,
1,316 pre-combined hangul, and 1,672 hanja. KIPS (Korean Information
Processing System) enumerated 2,058 pre-combined hangul and 2,392
hanja. Both have been rendered obsolete by KS C 5601-1987.
2.5: CJK
The only true CJK character sets available today are CCCII,
ANSI Z39.64-1989 (also known as EACC or REACC), and ISO 10646-1:1993.
ISO 10646-1:1993 is unique in that it goes beyond CJK (Chinese
characters) to provide virtually all commonly-used alphabetic scripts.
Of these three, only ISO 10646-1:1993 is expected to gain
wide-spread acceptance. CCCII and ANSI Z39.64-1989 are still used
today, but primarily for bibliographic purposes.
2.5.1: ISO 10646-1:1993
Published by ISO (International Organization for
Standardization) in Switzerland, this character set enumerates over
34,000 characters. Its I-zone ("I" stands for "Ideograph") enumerates
approximately 21,000 Chinese characters, which is the result of a
massive effort by the CJK-JRG (CJK Joint Research Group) called "Han
Unification." The CJK-JRG is now called the IRG (Ideographic
Rapporteur Group), and is off doing additional research for future
Chinese character allocations to ISO 10646-1:1993.
The Basic Multilingual Plane (BMP) of ISO 10646-1:1993 is
equivalent to Unicode. While Unicode is comprised of a single plane of
characters (which doesn't allow much room for future expansion), ISO
10646-1:1993 contains hundreds of such planes.
One very nice feature of this standard's manual are the CJK
code correspondence tables in Section 26 (pp 262-698). Four columns
are provided for each ISO 10646-1:1993 I-zone code point -- simplified
Chinese, traditional Chinese, Japanese, and Korean. If the ISO
10646-1:1993 Chinese character maps to one of these locales, the
hexadecimal character code, (decimal) row-cell value, and glyph for
that locale is provided. The corresponding tables in Volume 2 of "The
Unicode Standard" provide character codes (sometimes the hexadecimal
character code, and sometimes the row-cell value) and a single
glyph. Quite unfortunate. I hear that a new edition of "The Unicode
Standard" is about to be released. I hope that this problem has been
addressed.
ISO 10646-1:1993 does not replace existing national character
set standards. It simply provides a single character set that is a
superset of *most* national character sets. For example, only a
fraction of the 48,027 hanzi in CNS 11643-1992 are included in ISO
10646-1:1993. I feel that it is best to think of ISO 10646-1:1993 as
"just another character set." My philosophy is to support the maximum
number of character sets and encodings as possible.
A note about ordering this standard. If you order through ANSI
in the United States, try to get an original manual. It is not easy,
though. You see, ANSI has duplication rights for ISO documents.
Photocopying Section 26 (pp 262-698) doesn't do the Chinese characters
much justice, and some characters become hard-to-read. Unfortunately,
there is no way to indicate that you want an original ISO document
through ANSI's ordering process, so some post-ordering haggling may
become necessary.
More information on ISO 10646-1:1993 can be found at the
following URL:
http://www.unicode.org/
Japan, China (PRC), and Korea have developed their own
national standards that are based on ISO 10646-1:1993. They are
designated as JIS X 0221-1995 (see Section 2.1.4), GB 13000.1-93 (see
Section 2.2.10), and KS C 5700-1995 (see Section 2.4.5), respectively.
Note that these national-standard versions of Unicode are
aligned differently with its three versions:
Unicode Version 1.0
Unicode Version 1.1 <-> ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93
Unicode Version 2.0 <-> KS C 5700-1995
One of the major changes made for Unicode Version 2.0 is the inclusion
of all 11,172 hangul. Versions 1.1 has 6,656 hangul.
2.5.2: CCCII
The Chinese Character Analysis Group in Taiwan developed CCCII
(Chinese Character Code for Information Interchange) in the 1980s.
This character set is composed of 94 planes that have 94 rows and 94
cells (94 x 94 x 94 = 830,584 characters). Furthermore, every six
planes constitute a "layer" (6 x 94 x 94 = 53,016 characters). The
following is the contents of each of the 16 layers (the 16th layer
contains only four planes):
o Layer 1: Symbols and Traditional Chinese characters
o Layer 2: Simplified Chinese characters from PRC
o Layers 3 through 12: Variant Chinese character forms
o Layer 13: Japanese kana and kokuji (Japanese-made kanji)
o Layer 14: Korean hangul
o Layer 15: Reserved
o Layer 16: Miscellaneous characters (Japanese and Korean)
Layers 1 through 12 have a special meaning and relationship.
The same code point in these layers is designed to hold the same
character, but with different forms. Layer 1 code points contain the
traditional character forms, Layer 2 code points contain the
simplified character forms (if any), and Layers 3 through 12 contain
variant character forms (if any). For example, given a Chinese
character with three forms, its encoding and arrangement may be as
follows:
Character Form Code Point Layer
^^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^
Traditional 0x224E41 1
Simplified 0x284E41 2
Variant 0x2E4E41 3
Note how the second and third bytes (0x4E41) are identical in all
three instances -- only the first byte's value, which indicates the
layer, differs. Needless to say, this method of arrangement provides
easy access to related Chinese character forms. No wonder it is used
for bibliographic purposes.
The first layer is composed as follows:
o Plane 1/Row 2: 56 mathematical symbols
o Plane 1/Row 3: The ASCII character set
o Plane 1/Row 11: 35 Chinese punctuation marks
o Plane 1/Rows 12 through 14: 214 classical radicals
o Plane 1/Row 15: 41 Chinese numerical symbols, 37 phonetic symbols,
and 4 tone marks
o Plane 1/Rows 16 through 67: 4,808 common Chinese characters
o Plane 1/Row 68 through Plane 3/Row 64: 17,032 less common Chinese
characters
o Plane 3/Row 65 through Plane 6/Row 5: 20,583 rare Chinese characters
Note that Row 1 of all planes is reserved, and never assigned
characters. Take this into account when studying the above table
ranges that span planes (that is, skip Row 1).
In addition to the above, there are 11,517 simplified Chinese
characters in Layer 2 (3,625 are considered PRC simplified forms, and
the remaining 7,892 are regular simplified forms). This provides a
total of 53,940 Chinese characters.
Further information on CCCII (to include very interesting
historical notes) can be found on pp 146-149 of John Clews' "Language
Automation Worldwide: The Development of Character Set Standards" and
Chapter 6 of Huang & Huang's "An Introduction to Chinese, Japanese,
and Korean Computing."
2.5.3: ANSI Z39.64-1989
This national standard is designated as ANSI Z39.64-1989 and
named "East Asian Character Code" (EACC), but was originally known as
REACC (RLIN East Asian Character Code), that is, before it became a
national standard. RLIN stands for "Research Libraries Information
Network," which was developed by the Research Libraries Group (RLG)
located in Mountain View, California.
RLG's Home Page is at the following URL:
http://www.rlg.org/
The structure of ANSI Z39.64-1989 is based on CCCII, but with
a few differences. Many consider it to be superior to and a
replacement for CCCII (see Section 2.5.2).
The ANSI Z39.64-1989 standard is available through ANSI, but
you should be aware that it is distributed in the form of several
microfiche. Not a terribly useful storage medium these days. I had my
set tranformed into tangible printed pages. You can also obtain this
standard through NISO (National Information Standards Organization)
Press Fulfillment. Their URL is:
http://www.niso.org/
EACC has been designated by the Library of Congress as a
character set for use in USMARC (United States MAchine-Readable
Cataloging) records, and is used extensively by East Asian libraries
across North America.
EACC is also being used in Australia for the National CJK
Project. Check out the following URL for more details:
http://www.nla.gov.au/1/asian/ncjk/cjkhome.html
Further information on ANSI Z39.64-1989 (to include very
interesting historical notes) can be found on pp 150-156 of John
Clews' "Language Automation Worldwide: The Development of Character
Set Standards" (although a source at RLG tells me that some of Clews'
facts are wrong) and Chapter 6 of Huang & Huang's "An Introduction to
Chinese, Japanese, and Korean Computing."
The authoritative paper on EACC is "RLIN East Asian Character
Code and the RLIN CJK Thesaurus" by Karen Smith Yoshimura and Alan
Tucker, published in "Proceedings of the Second Asian-Pacific
Conference on Library Science," May 20-24,1985, Seoul, Korea.
2.6: OTHER
This section includes character set standards that don't
properly fall under the above sections.
2.6.1: GB 8045-87
GB 8045-87 is a Mongolian character set standard established
by China (PRC). This standard enumerates 94 Mongolian characters. Of
these 94 characters, 12 are punctuation (vertically-oriented), and the
remaining 82 are characters specific to the Mongolian script.
Mongolian is written vertically like Chinese.
I do not discuss the encoding for GB 8045-87 in Part 3, so
will do it here. The GB 8045-87 manual describes a 7- and 8-bit
encoding. The 7-bit encoding puts these 94 characters in the standard
ASCII printable range, namely 0x21 through 0x7E. Code point 0x20 is
marked as "MSP" which stands for "Mongolian space." The 8-bit encoding
puts these 94 characters in the range 0xA1 through 0xFE, with the
"MSP" character at code point 0xA0. The GB 1988-89 set is then encoded
in the range 0x21 through 0x7E.
2.6.2: TCVN-5773:1993
TCVN-5773:1993 (also called NSCII, which is short for Nom
Standard Code for Information Interchange) is the Vietnamese analog to
ISO 10646-1:1993, but adds 1,775 Vietnamese-specific Chinese
characters. These 1,775 characters are encoded in the range 0xA000
through 0xA6EE.
More information on TCVN-5773:1993 can be found at the
following URL:
ftp://unicode.org/pub/MappingTables/EastAsiaMaps/
There are two files at the above URL that pertain to this standard.
The first is a README, and the second is a Macintosh HyperCard stack
(requires HyperCard):
TCVN-NSCII.README
TCVN-NSCIIstack_1.0.sea.hqx
PART 3: CJK ENCODING SYSTEMS
These sections describe the various systems for encoding the
character set standards listed in Part 2. The first two described,
7-bit ISO 2022 and EUC, are not specific to a locale, and in some
cases not specific to CJK.
The CJK Character Set Server at the following URL can generate
character sets based on encodings described in this section:
http://jasper.ora.com/lunde/cjk-char.html
I suggest that you use this as a way to obtain files that illustrate
these encodings in action.
But first, please take a peek at the following table, which is
an attempt to illustrate how two Chinese characters (that stand for
"kanji/hanzi/hanja") are encoded using the various methods presented
in the following sections (character codes as hexadecimal digits, and
escape sequences or shift sequences as printable characters):
o Japanese (JIS X 0208-1990 & JIS X 0201-1976):
- 7-bit ISO 2022 <ESC> & @ <ESC> $ B 0x3441 0x3B7A <ESC> ( J
- ISO-2022-JP <ESC> $ B 0x3441 0x3B7A <ESC> ( J
- EUC 0xB4C1 0xBBFA
- Shift-JIS 0x8ABF 0x8E9A
o Simplified Chinese (GB 2312-80 & GB 1988-89 or ASCII):
- 7-bit ISO 2022 <ESC> $ A 0x3A3A 0x5756 <ESC> ( T
- ISO-2022-CN <ESC> $ ) A <SO> 0x3A3A 0x5756 <SI>
- EUC 0xBABA 0xD7D6
- HZ (HZ-GB-2312) ~{ 0x3A3A 0x5756 ~}
- zW zW 0x3A3A 0x5756
o Traditional Chinese (CNS 11643-1992):
- 7-bit ISO 2022 <ESC> $ ( G 0x6947 0x4773 <ESC> ( B
- ISO-2022-CN <ESC> $ ) G <SO> 0x6947 0x4773 <SI>
- EUC 0xE9C7 0xC7F3 or 0x8EA1E9C7 0x8EA1C7F3
o Traditional Chinese (Big Five):
- Big Five 0xBA7E 0xA672
o Korean (KS C 5601-1992 & ASCII):
- 7-bit ISO 2022 <ESC> $ ( C 0x7953 0x6D2E <ESC> ( B
- ISO-2022-KR <ESC> $ ) C <SO> 0x7953 0x6D2E <SI>
- EUC 0xF9D3 0xEDAE
- Johab 0xF7D3 0xF1AE
o CJK (ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93, or KS C
5700-1995):
- UCS-2 0x6F22 0x5B57
- UCS-4 0x00006F22 0x00005B57
The above should have given you a taste of what information the
following sections provide.
3.1: 7-BIT ISO 2022 ENCODING
7-bit ISO 2022 is the name commonly given to the encoding
system that uses escape sequences to shift between character sets.
(ISO 2022 encoded Japanese text is also known as "JIS" encoding, but
is different from ISO-2022-JP and ISO-2022-JP-2, and will be explained
in Section 3.1.3.) This encoding comes from the ISO 2022-1993
standard.
An escape sequence, as the name implies, consists of an escape
character followed by a sequence of one or more characters. These
escape sequences are used to change character set of the text
stream. This may also mean a shift from one- to two-byte-per-character
mode (or vice versa).
7-bit ISO 2022 Character sets fall into two types: one-byte
and two-byte. CJK character sets, for obvious reasons, fall into the
latter group.
One advantage that 7-bit ISO 2022 encoding has over other
encoding systems is that its escape sequences specify the character
set, thus specify the locale. 7-bit ISO 2022 encoding also encodes
text using only seven-bit bytes, which has the benefit of being able
to survive Internet travel (e-mail).
3.1.1: CODE SPACE
Each byte in the representation of graphic (printable)
characters fall into the range 0x21 (decimal 33) through 0x7E (decimal
126). For one-byte character sets, this means a maximum of 94
characters. For two-byte character sets, this means a maximum of 8,836
characters (94 x 94 = 8,836).
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x21-0x7E
Two-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x21-0x7E
second byte range 0x21-0x7E
White space and control characters (of which the "escape" character is
one) are still found in 0x00-0x20 and 0x7F.
3.1.2: ISO-REGISTERED ESCAPE SEQUENCES
The following is a table that provides the ISO-registered
escape sequences for various one- and two-byte character sets
mentioned in Part 2 of this document (ISO registration numbers
provided in the fourth column):
One-byte Character Set Escape Sequence Hexadecimal ISO Reg
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^
ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842 6
Half-width katakana <ESC> ( I 0x1B2849 13
JIS X 0201-1976 Roman <ESC> ( J 0x1B284A 14
GB 1988-89 Roman <ESC> ( T 0x1B2854 57
Two-byte Character Set Escape Sequence Hexadecimal ISO Reg
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^
JIS C 6226-1978 <ESC> $ @ 0x1B2440 42
GB 2312-80 <ESC> $ A 0x1B2441 58
JIS X 0208-1983 <ESC> $ B 0x1B2442 87
KS C 5601-1992 <ESC> $ ( C 0x1B242843 149
JIS X 0212-1990 <ESC> $ ( D 0x1B242844 159
ISO-IR-165:1992 <ESC> $ ( E 0x1B242845 165
JIS X 0208-1990 <ESC> & @ <ESC> $ B 0x1B26401B2442 168
CNS 11643-1992 Plane 1 <ESC> $ ( G 0x1B242847 171
CNS 11643-1992 Plane 2 <ESC> $ ( H 0x1B242848 172
CNS 11643-1992 Plane 3 <ESC> $ ( I 0x1B242849 183
CNS 11643-1992 Plane 4 <ESC> $ ( J 0x1B24284A 184
CNS 11643-1992 Plane 5 <ESC> $ ( K 0x1B24284B 185
CNS 11643-1992 Plane 6 <ESC> $ ( L 0x1B24284C 186
CNS 11643-1992 Plane 7 <ESC> $ ( M 0x1B24284D 187
Note that the first four two-byte character sets do not use an opening
parenthesis (0x28 or "(") in their escape sequences, which means that
they don't follow the 7-bit ISO 2022 rules precisely. They are shorter
for historical reasons, and are retained for backwards compatibility.
Also note that not all of the CJK character set standards described in
Part 2 have ISO-registered escape sequences.
There are other encoding methods that are similar to 7-bit ISO
2022 in that they are suitable for Internet use, but are locale-
specific. These include HZ and zW encoding, both of which are specific
to the GB 2312-80 character set (see Sections 3.3.2 and 3.3.3).
ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, and ISO-2022-CN-EXT are
described below.
3.1.3: ISO-2022-JP AND ISO-2022-JP-2
ISO-2022-JP is best described as a subset of 7-bit ISO 2022
encoding for Japanese, and reflects how Japanese text is encoded for
e-mail messages. ISO-2022-JP-2 is an extension that supports
additional character sets.
There are only four escape sequences permitted in ISO-2022-JP,
indicated as follows:
One-byte Character Set Escape Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^
ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842
JIS X 0201-1976 Roman <ESC> ( J 0x1B284A
Two-byte Character Set Escape Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^
JIS C 6226-1978 <ESC> $ @ 0x1B2440
JIS X 0208-1983 <ESC> $ B 0x1B2442
Note the lack of JIS X 0208-1990, JIS X 0212-1990, and half-width
katakana escape sequences. The JIS X 0208-1983 escape sequence is used
to indicate both JIS X 0208-1983 and JIS X 0208-1990 (for practical
reasons).
ISO-2022-JP-2 permits additional escape sequences, indicated
as follows:
One-byte Character Set Escape Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^
ASCII (ANSI X3.4-1986) <ESC> ( B 0x1B2842
JIS X 0201-1976 Roman <ESC> ( J 0x1B284A
Two-byte Character Set Escape Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^
JIS C 6226-1978 <ESC> $ @ 0x1B2440
JIS X 0208-1983 <ESC> $ B 0x1B2442
JIS X 0212-1990 <ESC> $ ( D 0x1B242844
GB 2312-80 <ESC> $ A 0x1B2441
KS C 5601-1992 <ESC> $ ( C 0x1B242843
With the introduction of ISO-2022-KR (see Section 3.1.4), ISO-2022-CN
(see Section 3.1.5), and ISO-2022-CN-EXT (see Section 3.1.5), the
usefulness of supporting GB 2312-80 and KS C 5601-1992 can be
questioned. However, ISO-2022-JP-2 provides support for JIS X
0212-1990.
More detailed information on ISO-2022-JP encoding can be found
in RFC 1468. And, more detailed information on ISO-2022-JP-2 encoding
can be found in RFC 1554.
3.1.4: ISO-2022-KR
ISO-2022-KR is similar to ISO-2022-JP (see Section 3.1.3) in
that it reflects how Korean text is encoded for e-mail messages.
However, its actual implementation is a bit different. Below is a
summary.
There are only two shift sequences used in ISO-2022-KR,
indicated as follows:
One-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
ASCII (ANSI X3.4-1986) <SI> 0x0F
Two-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
KS C 5601-1992 <SO> 0x0E
Furthermore, the following designator sequence must appear only once,
at the beginning of a line, before any KS C 5601-1992 characters (this
usually means that it appears by itself on the first line of the
file):
<ESC> $ ) C 0x1B242943
It almost looks the same as the KS C 5601-1992 escape sequence in
7-bit ISO 2022, but look again. The opening parenthesis (0x28 or "(")
is replaced by a closing parenthesis (0x29 or ")"). This designator
sequence serves a different purpose than an escape sequence. It is
like a flag indicating that "this document contains KS C 5601-1992
characters." The <SO> and <SI> control characters actually perform the
switching between one- (ASCII) and two-byte (KS C 5601-1992) codes.
More detailed information on ISO-2022-KR encoding can be found
in RFC 1557.
3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT
ISO-2022-CN and ISO-2022-CN-EXT are similar to ISO-2022-JP
(see Section 3.1.3) and ISO-2022-KR (see Section 3.1.4) in that they
reflect how Chinese text is encoded for e-mail messages.
Like with ISO-2022-KR, there are only two shift sequences,
indicated as follows:
One-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
ASCII (ANSI X3.4-1986) <SI> 0x0F
Two-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
<Too Many to List> <SO> 0x0E
But, unlike ISO-2022-KR, there are single shift sequences. Single
shift means that they are used before every (single) character, not
before sequences of characters.
Single Shift Type Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
SS2 <ESC> N 0x1B4E
SS3 <ESC> O (not zero!) 0x1B4F
ISO-2022-CN supports the following character sets using SO and
SS2 designations:
Character Set Type Designation Sequence Hexadecimal
^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
GB 2312-80 SO <ESC> $ ) A 0x1B242941
CNS 11643-1992 Plane 1 SO <ESC> $ ) G 0x1B242947
CNS 11643-1992 Plane 2 SS2 <ESC> $ * H 0x1B242A48
The designator sequences must appear once on a line before any
instance of the character set it designates. If two lines contain
characters from the same character set, both lines must include the
designator sequence (this is so the text can be displayed correctly
when scroll back in a window). This is different behavior from
ISO-2022-KR where the designator sequence appears once in the entire
file (this is because ISO-2022-KR supports a single two-byte character
set).
ISO-2022-CN-EXT supports the following character sets using
SO, SS2, and SS3 designations (notice how ISO-2022-CN is still
supported in the same manner):
Character Set Type Designation Sequence Hexadecimal
^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
GB 2312-80 SO <ESC> $ ) A 0x1B242941
GB/T 12345-90 SO NOT REGISTERED
ISO-IR-165 SO <ESC> $ ) E 0x1B242945
CNS 11643-1992 Plane 1 SO <ESC> $ ) G 0x1B242947
CNS 11643-1992 Plane 2 SS2 <ESC> $ * H 0x1B242A48
GB 7589-87 SS2 NOT REGISTERED
GB/T 13131-9X SS2 NOT REGISTERED
CNS 11643-1992 Plane 3 SS3 <ESC> $ + I 0x1B242B49
CNS 11643-1992 Plane 4 SS3 <ESC> $ + J 0x1B242B4A
CNS 11643-1992 Plane 5 SS3 <ESC> $ + K 0x1B242B4B
CNS 11643-1992 Plane 6 SS3 <ESC> $ + L 0x1B242B4C
CNS 11643-1992 Plane 7 SS3 <ESC> $ + M 0x1B242B4D
GB 7590-87 SS3 NOT REGISTERED
GB/T 13132-9X SS3 NOT REGISTERED
Support for character sets indicated as NOT REGISTERED will be added
once they are ISO-registered.
More detailed information on ISO-2022-CN and ISO-2022-CN-EXT
encodings can be found in RFC 1922.
3.2: EUC ENCODING
EUC stands for "Extended UNIX Code," and is a rich encoding
system from ISO 2022-1993 that is designed to handle large or multiple
character sets. It is primarily used on UNIX systems, such as Sun's
Solaris.
EUC consists of four codes sets, numbered 0 through 3. The
only code set that is more or less fixed by definition is code set 0,
which is specified to contain ASCII or a locale's equivalent (such as
JIS X 0201-1976 for Japanese or GB 1988-89 for PRC Chinese).
It is quite common to append the locale name to "EUC" when
designating a specific instance of EUC encoding. Common designations
include EUC-JP, EUC-CN, EUC-KR, and EUC-TW.
3.2.1: JAPANESE REPRESENTATION
The following table illustrates the Japanese representation of
EUC packed format:
EUC Code Sets Encoding Range
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Code set 0 (ASCII or JIS X 0201-1976 Roman): 0x21-0x7E
Code set 1 (JIS X 0208): 0xA1A1-0xFEFE
Code set 2 (half-width katakana): 0x8EA1-0x8EDF
Code set 3 (JIS X 0212-1990): 0x8FA1A1-0x8FFEFE
An earlier version of EUC for Japanese used code set 3 as the user-
defined range.
3.2.2: CHINESE (PRC) REPRESENTATION
The following table illustrates the Chinese (PRC)
representation of EUC packed format:
EUC Code Sets Encoding Range
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Code set 0 (ASCII or GB 1988-89): 0x21-0x7E
Code set 1 (GB 2312-80): 0xA1A1-0xFEFE
Code set 2: unused
Code set 3: unused
Note how code sets 2 and 3 are unused.
The encoding used on Macintosh is quite similar, but has a
shortened two-byte range (0xA1A1 through 0xFCFE) plus additional
one-byte code points, namely 0x80 ("u" with dieresis), 0xFD
("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM"
as a superscript), and 0xFF ("ellipsis" symbol: three dots).
3.2.3: CHINESE (TAIWAN) REPRESENTATION
The following table illustrates the Chinese (Taiwan)
representation of EUC packed format:
EUC Code Sets Encoding Range
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Code set 0 (ASCII): 0x21-0x7E
Code set 1 (CNS 11643-1992 Plane 1): 0xA1A1-0xFEFE
Code set 2 (CNS 11643-1992 Planes 1-16): 0x8EA1A1A1-0x8EB0FEFE
Code set 3: unused
Note how CNS 11643-1992 Plane 1 is redundantly encoded in code set 1
(two-byte) and code set 2 (four-byte). The second byte of code set 2
indicates the plane number. For example, 0xA1 is Plane 1 and so on up
until 0xB0, which is Plane 16.
3.2.4: KOREAN REPRESENTATION
The following table illustrates the Korean representation of
EUC packed format (this is also known as "Wansung" encoding -- the
Korean word "wansung" means "pre-compose"):
EUC Code Sets Encoding Range
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Code set 0 (ASCII or KS C 5636-1993): 0x21-0x7E
Code set 1 (KS C 5601-1992): 0xA1A1-0xFEFE
Code set 2: unused
Code set 3: unused
Note how code sets 2 and 3 are unused.
The encoding used on Macintosh is quite similar, but has a
shortened two-byte range (0xA1A1 through 0xFDFE) plus additional
one-byte code points, namely 0x81 ("won" symbol), 0x82 (hyphen), 0x83
("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM"
as a superscript), and 0xFF ("ellipsis" symbol: three dots).
See Section 3.3.17 for a description of Microsoft's extension
to this encoding, called Unified Hangul Code.
3.3: LOCALE-SPECIFIC ENCODINGS
The encoding systems described in the following sections are
considered to be locale-specific, namely that are used to encode a
specific character set standard. This is not to say that they are not
widely used (actually, some of these are among the most widely used
encoding systems!), but rather that they are tied to a specific
character set.
3.3.1: SHIFT-JIS
Shift-JIS (also known as MS Kanji, SJIS, or DBCS-PC) is the
encoding system used on machines that support MS-DOS or Windows, and
also for Macintosh (KanjiTalk or Japanese Language Kit). It was
originally developed by Microsoft Corporation as a way to support the
Japanese character set on MS-DOS. The following tables provide the
Shift-JIS encoding ranges:
Two-byte Standard Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte ranges 0x81-0x9F, 0xE0-0xEF
second byte ranges 0x40-0x7E, 0x80-0xFC
Two-byte User-defined Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0xF0-0xFC
second byte ranges 0x40-0x7E, 0x80-0xFC
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Half-width katakana 0xA1-0xDF
ASCII/JIS-Roman 0x21-0x7E
It is important to note that the user-defined range does not
correspond to code points in other encodings that support Japanese,
such as 7-bit ISO 2022 or EUC. This is a portability problem. It is
also unique in that it does not support the JIS X 0212-1990 character
set standard.
The encoding used on Macintosh is quite similar to the above
table, but has additional one-byte code points, namely 0x80
(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
symbol: three dots).
3.3.2: HZ (HZ-GB-2312)
HZ is a simple yet very powerful and reliable system for
encoding GB 2312-80 text which was developed by Fung Fung Lee
(lee@umunhum.stanford.edu). HZ encoding is commonly used when
exchanging e-mail or posting messages to Usenet News (specifically, to
alt.chinese.text).
The actual encoding ranges used for one- and two-byte
characters is almost identical to 7-bit ISO 2022 encoding (see Section
3.1.1). The first-byte range is limited to 0x21 through 0x77. But,
instead of using an escape sequence to shift between one- and two-byte
character modes, a simple string of two printable characters is used.
One-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
ASCII ~} 0x7E7D
Two-byte Character Set Shift Sequence Hexadecimal
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^
GB 2312-80 ~{ 0x7E7B
The tilde character (0x7E) is interpreted as an escape character in HZ
encoding, so it has special meaning. If a tilde character is to appear
in one-byte-per-character mode, it must be doubled (so ~~ would appear
as just ~). This means that there are three escape sequences used in
HZ encoding:
Escape Sequence Meaning
^^^^^^^^^^^^^^^ ^^^^^^^
~~ ~ in one-byte-per-character mode
~} Shift into one-byte-per-character mode
~{ Shift into two-byte-per-character mode
There is also a fourth escape sequence, namely ~ plus a newline
character (~\n). This escape sequence is a line-continuation marker to
be consumed with no output produced.
This method works without problems because the shift sequences
represent empty positions in the very last row of the GB 2312-80 table
(actually, the second- and third-from-last code points). HZ encoding
makes 77 of the 94 rows accessible, and because there are no defined
characters beyond row 77, this causes no problems.
The complete HZ specification is part of the HZ package,
described in RFC 1843, and available in HTML format. These are
available at the following URLs:
ftp://ftp.ifcss.org/pub/software/unix/convert/HZ-2.0.tar.gz
ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/rfc-1843.txt
http://umunhum.stanford.edu/~lee/chicomp/HZ_spec.html
In addition, RFC 1842 establishes "HZ-GB-2312" as the "charset"
parameter in MIME-encoded e-mail headers. Its properties are identical
to HZ encoding as described in RFC 1843.
3.3.3: zW
zW encoding, developed by Ya-Gui Wei and Edmund Lai, is older
than and somewhat similar to HZ encoding (HZ is considered to be a
better encoding system, and users are encouraged to switch over to HZ
encoding).
zW encoding is named by how it encodes each line of GB 2312-80
text, namely lines that contain Chinese text must begin with the two
characters "z" and "W" ("zW"). This encoding method does not permit
the mixture of one- (ASCII) and two-byte (GB 2312-80) characters on a
per-character basis, but rather on a per-line basis. That is, each
line can contain only Chinese or ASCII text, but not both.
More information on zW encoding can be found as part of the
ZWDOS package available at the following URL:
ftp://ftp.ifcss.org/pub/software/dos/ZWDOS/
3.3.4: BIG FIVE
Big Five is the encoding system used on machines that support
MS-DOS or Windows, and also for Macintosh (such as the Chinese
Language Kit or the fully-localized operating system).
Two-byte Standard Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0xA1-0xFE
second byte ranges 0x40-0x7E, 0xA1-0xFE
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
ASCII 0x21-0x7E
The encoding used on Macintosh is quite similar to the above,
but has a slightly shortened two-byte range (second byte range up to
0xFC only) plus additional one-byte code points, namely 0x80
(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
symbol: three dots).
3.3.5: JOHAB
Korean hangul characters are typically encoded in what is
known as pre-combined form, namely 2 or 3 hangul elements bound into a
single character. KS C 5601-1992 enumerates 2,350 such pre-combined
forms. While this number is felt to be sufficient for most purposes,
it does not account for the total number of possible permutations. The
encoding system that encodes all possible pre-combined hangul is known
as Johab encoding (also known as "two-byte combination code" -- the
Korean word "johab" means "combine"), and is described in Annex 3 of
the KS C 5601-1992 standard. This encoding is almost like encoding all
possible three-letter words in English -- while all combinations are
possible, only a fraction represent *real* words.
Pre-combined hangul can be composed of 19 different initial,
21 different medial, and 27 different final hangul elements (28,
actually, if you count the placeholder). This provides a maximum of
11,172 pre-combined hangul. Of these 67 hangul elements, 51 are unique
(some can occur in different positions). Each of these positions are
encoded using five bits each (five bits can encode up to 32 unique
objects). The encoding array looks as follows:
o Bit 1: always on
o Bits 2-6: initial hangul element
o Bits 7-11: medial hangul element
o Bits 12-16: final hangul element
Initial and final elements are consonants, and the medial elements are
vowels. This encoding must be treated as a 16-bite entity because the
bit array of the medial hangul element spans the first and second byte.
Johab encoding also provides the complete set of KS C 5601-
1992 symbols and hanja, but in different code points. Annex 3 of the
KS C 5601-1992 manual (pp 33-34) contains a complete symbol and hanja
mapping table between EUC and Johab code points. (The KS C 5601-1989
manual did not have this.) The code space ranges for Johab encoding
are as follows:
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
ASCII or KS C 5636-1993 0x21-0x7E
Two-byte Pre-combined Hangul Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0x84-0xD3
second byte ranges 0x41-0x7E, 0x81-0xFE
Two-byte Symbols and Hanja Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte ranges 0xD8-0xDE, 0xE0-0xF9
second byte ranges 0x31-0x7E, 0x91-0xFE
Note that the second byte ranges encode a total of 188 characters, and
that the second byte ranges for hangul and symbols/hanja are slightly
different (yet the same size, namely 188 characters).
Here is a summary of the above table, which better describes
what is encoded where. Rows 0x84 through 0xD3 provide 80 rows of 188
characters each (15,040 code points, which is more than enough for the
11,172 pre-combined hangul). Row 0xD8 provides 188 user-defined
positions, the same as Rows 41 and 94 in the standard KS C 5601-1992
table. Rows 0xD9 through 0xDE encode Rows 1 through 12 of the standard
KS C 5601-1992 table (symbols). Rows 0xE0 through 0xF9 encode Rows 42
through 94 of the KS C 5601-1992 table (hanja). The following URL
provides a complete mapping table for the KS C 5601-1992 symbols and
hanja:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt
The following URLs provides similar information (they are the same
file), but only for the 11,172 pre-combined hangul:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt
ftp://unicode.org/pub/MappingTables/EastAsiaMaps/hangul-codes.txt
Of further interest may be that Microsoft designates Johab
encoding as its Code Page 1361. Microsoft if planning to support Johab
encoding for Korean Windows NT.
3.3.6: N-BYTE HANGUL
In the days before full two-byte capable operating systems,
each of the 51 basic hangul elements were encoding using a single
(7-bit) byte. The encoding range spans 0x40 through 0x7C, but there
are several unassigned gaps. This is known as the "N-byte Hangul"
code, and is described in Annex 4 (page 35) of the KS C 5601-1992
manual.
The following table illustrates these 51 one-byte code points
(the pronunciation or meaning of the hangul element is provided in
parentheses) and how they map to the three 5-bit arrays in Johab
encoding (expressed as binary patterns):
Element Initial Medial Final
^^^^^^^ ^^^^^^^ ^^^^^^ ^^^^^
0x40 ("fill") 00001 00010 00001
0x41 (g) 00010 ***** 00010
0x42 (gg) 00011 ***** 00011
0x43 (gs) ***** ***** 00100
0x44 (n) 00100 ***** 00101
0x45 (nj) ***** ***** 00110
0x46 (nh) ***** ***** 00111
0x47 (d) 00101 ***** 01000
0x48 (dd) 00110 ***** *****
0x49 (r) 00111 ***** 01001
0x4A (rg) ***** ***** 01010
0x4B (rm) ***** ***** 01011
0x4C (rb) ***** ***** 01100
0x4D (rs) ***** ***** 01101
0x4E (rt) ***** ***** 01110
0x4F (rp) ***** ***** 01111
0x50 (rh) ***** ***** 10000
0x51 (m) 01000 ***** 10001
0x52 (b) 01001 ***** 10011
0x53 (bb) 01010 ***** *****
0x54 (bs) ***** ***** 10100
0x55 (s) 01011 ***** 10101
0x56 (ss) 01100 ***** 10110
0x57 (ng) 01101 ***** 10111
0x58 (j) 01110 ***** 11000
0x59 (jj) 01111 ***** *****
0x5A (c) 10000 ***** 11001
0x5B (k) 10001 ***** 11010
0x5C (t) 10010 ***** 11011
0x5D (p) 10011 ***** 11100
0x5E (h) 10100 ***** 11101
0x5F UNASSIGNED
0x60 UNASSIGNED
0x61 UNASSIGNED
0x62 (a) ***** 00011 *****
0x63 (ae) ***** 00100 *****
0x64 (ya) ***** 00101 *****
0x65 (yae) ***** 00110 *****
0x66 (eo) ***** 00111 *****
0x67 (e) ***** 01010 *****
0x68 UNASSIGNED
0x69 UNASSIGNED
0x6A (yeo) ***** 01011 *****
0x6B (ye) ***** 01100 *****
0x6C (o) ***** 01101 *****
0x6D (wa) ***** 01110 *****
0x6E (wae) ***** 01111 *****
0x6F (oe) ***** 10010 *****
0x70 UNASSIGNED
0x71 UNASSIGNED
0x72 (yo) ***** 10011 *****
0x73 (u) ***** 10100 *****
0x74 (weo) ***** 10101 *****
0x75 (we) ***** 10110 *****
0x76 (wi) ***** 10111 *****
0x77 (yu) ***** 11010 *****
0x78 UNASSIGNED
0x79 UNASSIGNED
0x7A (eu) ***** 11011 *****
0x7B (yi) ***** 11100 *****
0x7C (i) ***** 11101 *****
There are utilities to convert N-byte Hangul code to other,
more widely-used, encoding methods. Pointers to these and other code
conversion utilities can be found in Section 4.7.
3.3.7: UCS-2
UCS-2 (Universal Character Set containing 2 bytes) encoding is
one way to encode ISO 10646-1:1993 text, and is considered identical
to Unicode encoding. Its encoding range, which is quite simple, is as
follows:
ISO 10646-1:1993 Characters Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x00-0xFF
second byte range 0x00-0xFF
Yes, folks, the whole range of 65,536 possible code points are
available for encoding characters. The "signature" that indicates a
file using UCS-2 is as follows:
0xFEFF
Escape sequences for UCS-2 have already been registered with
ISO, and are as follows:
ISO 10646-1:1993 Escape Sequence Hexadecimal ISO Reg
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^
UCS-2 Level 1 <ESC> % / @ 0x1B252F40 162
UCS-2 Level 2 <ESC> % / C 0x1B252F43 174
UCS-2 Level 3 <ESC> % / E 0x1B252F45 176
So what do these three levels mean? Level 3 means all characters in
ISO 10646-1:1993 with no restrictions (0x0000 through 0xFFFF).
Level 2 begins to restrict the character set by not including
the following characters or character ranges:
0x0300-0x0345 0x09D7 0x0BD7 0x11A8-0x11F9
0x0360-0x0361 0x0A3C 0x0C55-0x0C56 0x20D0-0x20E1
0x0483-0x0486 0x0A70-0x0A71 0x0CD5-0x0CD6 0x302A-0x302F
0x093C 0x0ABC 0x0D57 0x3099-0x309A
0x0953-0x0954 0x0B3C 0x1100-0x1159 0xFE20-0xFE23
0x09BC 0x0B56-0x0B57 0x115F-0x11A2
These are all combining characters, and represent 364 code points.
Level 1 further restricts the character set by not including
the following characters or character ranges:
0x05B0-0x05B9 0x09BE-0x09C4 0x0B47-0x0B48 0x0D02-0x0D03
0x05BB-0x05BD 0x09C7-0x09C8 0x0B4B-0x0B4D 0x0D3E-0x0D43
0x05BF 0x09CB-0x09CD 0x0B82-0x0B83 0x0D46-0x0D48
0x05C1-0x05C2 0x09E2-0x09E3 0x0BBE-0x0BC2 0x0D4A-0x0D4D
0x064B-0x0652 0x0A02 0x0BC6-0x0BC8 0x0E31
0x0670 0x0A3E-0x0A42 0x0BCA-0x0BCD 0x0E34-0x0E3A
0x06D6-0x06E4 0x0A47-0x0A48 0x0C01-0x0C03 0x0E47-0x0E4E
0x06E7-0x06E8 0x0A4B-0x0A4D 0x0C3E-0x0C44 0x0EB1
0x06EA-0x06ED 0x0A81-0x0A83 0x0C46-0x0C48 0x0EB4-0x0EB9
0x0901-0x0903 0x0ABE-0x0AC5 0x0C4A-0x0C4D 0x0EBB-0x0EBC
0x093E-0x094D 0x0AC7-0x0AC9 0x0C82-0x0C83 0x0EC8-0x0ECD
0x0951-0x0952 0x0ACB-0x0ACD 0x0CBE-0x0CC4 0xFB1E
0x0962-0x0963 0x0B01-0x0B03 0x0CC6-0x0CC8
0x0981-0x0983 0x0B3E-0x0B43 0x0CCA-0x0CCD
These, too, are all combining characters, and represent 586 code
points (222 above plus the 364 characters from the Level 2
restriction).
3.3.8: UCS-4
UCS-4 (Universal Character Set containing 4 bytes) encoding is
another way to encode ISO 10646-1:1993 text, and is used for future
expansion of the character set. Its encoding range is as follows:
ISO 10646-1:1993 Characters Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x00-0x7F
second byte range 0x00-0xFF
third byte range 0x00-0xFF
fourth byte range 0x00-0xFF
Note that the first byte range only goes up to 0x7F. This means that
UCS-4 is a 31-bit encoding. And, in case you're wondering, 31 bits
provide 2,147,483,648 code points. The "signature" that indicates a
file using UCS-4 is as follows:
0x0000 0xFEFF
Escape sequences for UCS-4 have already been registered with
ISO, and are as follows:
ISO 10646-1:1993 Escape Sequence Hexadecimal ISO Reg
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^
UCS-4 Level 1 <ESC> % / A 0x1B252F41 163
UCS-4 Level 2 <ESC> % / D 0x1B252F44 175
UCS-4 Level 3 <ESC> % / F 0x1B252F46 177
See the end of Section 3.3.7 for a description of these three levels.
But, in the case of UCS-4, simply prepend "0000" to all the values.
3.3.9: UTF-7
It turns out that *raw* ISO 10646-1:1993 encoding (that is,
UCS-2 or UCS-4) can cause problems because null bytes (0x00) are
possible (and frequent). Several UTFs (UCS Transformation Formats)
have been developed to deal with this and other problems. I must admit
that I don't know too much about UTFs, and what I provide below is
minimal, but does include pointers to more complete descriptions.
UTF-7 is a mail-safe 7-bit transformation format for UCS-2
(including UTF-16). It uses straight ASCII for many ASCII characters,
and switches into a Base64 encoding of UCS-2 or UTF-16 for everything
else. It was designed to be usable in MIME-compliant e-mail headers as
well as message bodies, and to pass through gateways to non-ASCII mail
systems (like Bitnet). More detailed information on UTF-7 can be found
in RFC 1642, and a UTF-7 converter is available. The following URLs
provide this information:
http://www.stonehand.com/unicode/standard/utf7.html
ftp://unicode.org/pub/Programs/ConvertUTF/
3.3.10: UTF-8
UTF-8 (also known as UTF-2 or FSS-UTF -- FSS stands for "file
system safe") can represent any character in UCS-2 and UCS-4, and is
officially an annex to ISO 10646-1:1993. It is different from UTF-7 in
that it encodes character sets into 8-bit bytes. UCS-2 and UCS-4 have
problems with some file systems and utilities, so this UTF was
developed.
More detailed information on UTF-8 and its relationship with
ISO 10646-1:1993 can be found at the following URLs:
http://www.stonehand.com/unicode/standard/utf8.html
ftp://unicode.org/pub/Programs/ConvertUTF/
X/Open Company Limited also published a document that
describes UTF-8 in detail (they call it FSS-UTF), and you can find
information about it at the following URL:
http://www.xopen.co.uk/public/pubs/catalog/c501.htm
The new programming language called Java supports Unicode through
UTF-8. More information on Java is at the following URL:
http://www.javasoft.com/
3.3.11: UTF-16
UTF-16 (formerly UCS-2E), like UTF-8, is now officially an
annex to ISO 10646-1:1993. From what I've read, UTF-16 transforms
UCS-4 into a 16-bit form. UTF-16 can then be further encoded in UTF-7
or UTF-8 (but doing this is not according to the standard -- there is
little to gain by doing so).
More detailed information on UTF-16 and its relationship with
ISO 10646-1:1993 can be found at the following URLs:
http://www.stonehand.com/unicode/standard/utf16.html
ftp://unicode.org/pub/Programs/ConvertUTF/
3.3.12: ANSI Z39.64-1989
The encoding used for ANSI Z39.64-1989 (and CCCII) is three-
byte 7-bit ISO 2022, namely the following code space:
Three-byte ANSI Z39.64-1989 Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x21-0x7E
second byte range 0x21-0x7E
third byte range 0x21-0x7E
3.3.13: BASE64
Base64 encoding is mentioned here only because of its common
usage in e-mail headers, and relationship with MIME (Multi-purpose
Internet Mail Extensions). It is also a source of confusion. Base64 is
a method of encoding arbitrary bytes into the safest 64-character
ASCII subset, and is defined in RFC 1341 (which adapted it from RFC
1113). RFC 1341 was made obsolete by RFC 1521. RFC 1522 also provides
useful information, particularly for handling non-ASCII text, and
obsoletes RFC 1342.
Here is how it works. Every three bytes are encoded as a
four-byte sequence. That is, the 24 bits that make up the three bytes
are split into four 6-bit segments (6 bits can encode up to 64
characters). Each 6-bit segment is then converted into a character in
the Base64 Alphabet (see below). There is a 65th character, "=", which
has a special purpose (it functions as a "pad" if a full three-byte
sequence is not found). This all may sound a bit like uuencoding, but
it is different. The Base64 Alphabet is as follows:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
My name, written in Japanese kanji, is as follows when it is
EUC-encoded (six bytes, expressed as three groups of hexadecimal
values, one group for each character):
0xBEAE 0xCED3 0xB7F5
When these three EUC-encoded characters are converted to Base64
encoding, they appear as follows (eight bytes):
vq7O07f1
Base64 encoding is most commonly used for encoding non-ASCII
text that appears in e-mail headers. Of all the portions of an e-mail
message, its header gets manipulated the most during transmission, and
Base64 encoding offers a safe way to further encode non-ASCII text so
that it is not altered by mail-routing software. This is where Base64
encoding can cause confusion. For example, what goes through your mind
when you see the following chunk o' text?
From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=)
Many folks think that they are seeing ISO-2022-JP encoding. Not
true. The "ISO-2022-JP" portion is just a flag that indicates the
original encoding before Base64 encoding was applied. The actual
Base64-encoded portion is enclosed between question marks (?) as
follows:
From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=)
>^^^^^^^^<
The whole string enclosed in parentheses has several components, and
the following explains their purpose and relationships (using the
above string as an example):
Component Explanation
^^^^^^^^^ ^^^^^^^^^^^
=? Signals start of encoded string
ISO-2022-JP Charset name ("ISO-2022-JP" is for Japanese)
? Delimiter
B Encoding ("B" is for Base64)
? Delimiter
vq7O07f1 Example string of type "charset" encoded by "encoding"
?= Signals end of encoded string
One typically does not need to worry about encoding text as
Base64 (MIME-compliant mailing software usually performs this task for
you). The problem is usually trying to decode Base64-encoded text. A
Base64 decoder is available in Perl at the following URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/b64decode.pl
Note that this program takes "raw" Base64 data as input. Any non-
Base64 stuff must be stripped. I usually run this from within Mule
("C-u M-| b64decode.pl") after defining a region around the Base64-
encoded material. I hope to replace this program soon with one that
automatically recognizes the Base64-encoded portions.
Most MIME-compliant e-mail software can decode Base64-encoded
text.
3.3.14: IBM DBCS-HOST
The oldest two-byte encoding system is IBM's DBCS-Host. DBCS
stands for Double-Byte Character Set. DBCS-Host is still in use on
IBM's mainframe computer systems (hence the use of "Host").
DBCS-Host encoding is EBCDIC-based, and uses Shift characters,
0x0E and 0x0F, to switch between one- and two-byte mode. Its encoding
specifications are as follows:
Two-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x41-0xFE
second byte range 0x41-0xFE
Two-byte "Space" Character Code Point
^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^
first- and second byte 0x4040
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
EBCDIC 0x41-0xF9
Shifting Characters Code Point
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Two-byte 0x0E
One-byte 0x0F
This same encoding specification is shared by all of IBM's CJK
character sets, namely for Japanese, Simplified Chinese, Traditional
Chinese, and Korean.
3.3.15: IBM DBCS-PC
IBM's DBCS-PC encoding is used on IBM personal computers (that
is where the "PC" comes from). DBCS-PC encoding is ASCII-based, and
uses the values of characters' bytes themselves to switch between one-
and two-byte mode. Its encoding specifications are as follows:
Two-byte Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0x81-0xFE
second byte range 0x40-0x7E, 0x80-0xFE
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
ASCII 0x21-0x7E
This same encoding specification is shared by all of IBM's CJK
character sets, namely for Japanese, Simplified Chinese, Traditional
Chinese, and Korean.
DBCS-PC encoding for Japanese, although conforming to the
above encoding specifications, actually uses the same encoding
specifications for Shift-JIS, to include the full user-defined range
(see Section 3.3.1 for more details on Shift-JIS encoding). One big
accommodation is the half-width katakana range, namely 0xA1 through
0xDF. Further, the DBCS-PC code space that is outside the Shift-JIS
specification is unused.
DBCS-PC encoding for Korean uses the equivalent of EUC code
set 1 code points (0xA1A1 through 0xFEFE) for those characters that
are common with KS C 5601-1992. Those characters that are not common
with KS C 5601-1992, namely IBM's extensions, are within the DBCS-PC
encoding space, but outside EUC encoding space (0x9A through 0xA0).
Many hanja and pre-combined hangul are part of IBM's Korean extension.
Note that DBCS-PC is sort of useless without a corresponding
SBCS (Single-Byte Character Set) for the one-byte range. Mixing DBCS
and SBCS results in a MBCS (Multiple-Byte Character Set). How these
are mixed to form MBCSs is detailed in Section 3.4.
3.3.16: IBM DBCS-/TBCS-EUC
IBM has also developed DBCS-EUC and TBCS-EUC encodings. TBCS
stands for Triple-Byte Character Set. These essentially follow the EUC
encoding specifications, and were developed for use with IBM's AIX
(Advanced Interactive Executive) operating system, which is
UNIX-based.
Refer to Section 3.2 for all the details on EUC encoding.
3.3.17: UNIFIED HANGUL CODE
Microsoft has developed what is called "Unified Hangul Code"
(UHC) for its Windows 95 operating system (this was also known as
"Extended Wansung"). It is the optional, not standard, character set
of Win95K.
UHC provides full compatibility with KS C 5601-1992 EUC
encoding (see Section 3.2.4), but adds additional encoding ranges for
holding additional pre-combined hangul (more precisely, the 8,822 that
are needed to fully support the Johab character set). The following is
a table that provides the encoding ranges for UHC encoding:
Two-byte Standard Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0x81-0xFE
second byte ranges 0x41-0x5A, 0x61-0x7A,
and 0x81-0xFE
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
ASCII 0x21-0x7E
Note that 0xA1A1 through 0xFEFE in the above encoding is still
identical, in terms of character-to-code allocation, with KS C 5601-
1992 in EUC encoding.
Appendix G (pp 345-406) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the KS C
5601-1992 character set standard plus these Microsoft extensions
(8,822 pre-combined hangul) by UHC code (Microsoft calls this Code
Page 949).
3.3.18: TRON CODE
TRON (The Real-time Operating system Nucleus) is an OS
developed in Japan some time ago. Personal Media Corporation has done
work to develop BTRON (Business TRON), which is unique in that it is
the only commercially-available OS that supports JIS X 0212-1990.
TRON Code provides a one- and two-byte encoding space and a
method for switching between them.
The following is how the two-byte space in TRON Code is
allocated:
A-Zone (8,836 characters; JIS X 0208-1990) Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x21-0x7E
second byte range 0x21-0x7E
B-Zone (11,844 characters; JIS X 0212-1990) Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x80-0xFD
second byte range 0x21-0x7E
C-Zone (11,844 characters; unassigned) Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x21-0x7E
second byte range 0x80-0xFD
D-Zone (15,876 characters; unassigned) Encoding Range
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
first byte range 0x80-0xFD
second byte range 0x80-0xFD
Note how the B-Zone is larger that the conventional 94-by-94
matrix. In fact, the JIS X 0212-1990 portion of the B-Zone is
restricted to 0xA121-0xFD7E (93-by-94 matrix -- 0xFE as a first-byte
value is unavailable, and you will see why in a minute).
TRON Code implements "language specifying codes" consisting of
two bytes as follows:
Two-byte Japanese 0xFE21
One-byte English 0xFE80
0xFE21 in a one-byte stream invokes two-byte Japanese mode, and 0xFE80
in a two-byte stream invokes one-byte English mode.
The following is the one-byte encoding range for TRON Code:
One-byte Characters 0x21-0x7E and 0x80-0xFD
Control codes are in 0x00-0x20 and 0x7F (the usual ASCII control code
range). Also, 0xA0 is reserved as a fixed-width space character.
3.3.19: GBK
GBK is an extension to GB 2312-80 that adds all ISO 10646-
1:1993 (GB 13000.1-93) hanzi not already in GB 2312-80. GBK is defined
as a normative annex of GB 13000.1-93 (see Section 2.2.10). The "K" in
"GBK" is the first sound in the Chinese word meaning "extension" (read
"Kuo Zhan").
GBK is divided into five levels as follows:
Level Encoded Range Total Code Points Total Encoded Characters
^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
GBK/1 0xA1A1-0xA9FE 846 717
GBK/2 0xB0A1-0xF7FE 6,768 6,763
GBK/3 0x8140-0xA0FE 6,080 6,080
GBK/4 0xAA40-0xFEA0 8,160 8,160
GBK/5 0xA840-0xA9A0 192 166
There are also 1,894 user-defined code points as follows:
Encoded Range Total Code Points
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
0xAAA1-0xAFFE 564
0xF8A1-0xFEFE 658
0xA140-0xA7A0 672
GBK thus provides a total of 23,940 code points, 21,886 of
which are assigned.
Each "row" in the GBK code table consists of 190 characters.
The following describes the encoding ranges of GBK in detail:
Two-byte Standard Characters Encoding Ranges
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
first byte range 0x81-0xFE
second byte ranges 0x40-0x7E and 0x80-0xFE
One-byte Characters Encoding Range
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
ASCII 0x21-0x7E
Note that the sub-range 0xA1A1-0xFEFE in the above encoding is still
identical, in terms of character-to-code allocation, with GB 2312-80
in EUC encoding. GBK is therefore backward-compatible with GB 2312-80
and forward-compatible with ISO 10646-1:1993.
GBK is the standard character set and encoding for the
Simplified Chinese version of Windows 95.
3.4: CJK CODE PAGES
Many times one reads about references to "Code Pages" in
material about CJK (and other) character sets and encodings. These are
not literal pages, but rather references to a character set and
encoding combination. In the case of CJK Code Pages, they definitely
comprise more than one page!
Microsoft refers to its supported CJK character sets and
encodings through such Code Page designations. The following is a
listing of several Microsoft CJK Code Pages along with their
characteristics:
Code Page Characteristics
^^^^^^^^^ ^^^^^^^^^^^^^^^
932 JIS X 0208-1990 base, Shift-JIS encoding, Microsoft
extensions (NEC Row 13 and IBM select characters in
redundantly encoded in Rows 89 through 92 and Rows 115
through 119)
936 GB 2312-80 base, EUC encoding
949 KS C 5601-1992 base, Unified Hangul Code encoding,
remaining 8,822 pre-combined hangul as extension (all of
this is referred to as Unified Hangul Code)
950 Big Five base, Big Five encoding, Microsoft extensions
(actually, the ETen extensions of Row 89)
1361 Johab base, Johab encoding
IBM also uses Code Page designations, and, in fact, some
designations (and associated characteristics) are nearly identical to
those in the above table, most notably, Code Pages 932 and 936. IBM's
Code Page 932 does not include NEC Row 13 or IBM select characters in
Rows 89 through 92.
The best way to describe IBM Code Page designations is by
first listing the SBCS (Single-Byte Character Set) and DBCS (Double-
Byte Character Set) Code Page designations (those designated by "Host"
use EBCDIC-based encodings):
IBM SBCS Code Page Characteristics
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
37 (US) SBCS-Host
290 (Japanese) SBCS-Host
833 (Korean) SBCS-Host
836 (Simplified Chinese) SBCS-Host
891 (Korean) SBCS-PC
897 (Japanese) SBCS-PC
903 (Simplified Chinese) SBCS-PC
904 (Traditional Chinese) SBCS-PC
IBM DBCS Code Page Characteristics
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
300 (Japanese) DBCS-Host
301 (Japanese) DBCS-PC
834 (Korean) DBCS-Host
835 (Traditional Chinese) DBCS-Host
837 (Simplified Chinese) DBCS-Host
926 (Korean) DBCS-PC
927 (Traditional Chinese) DBCS-PC
928 (Simplified Chinese) DBCS-PC
So far there appears to be no relationship with Microsoft's CJK Code
Pages, but when we combine the above SBCS and DBCS Code Pages into
MBCS (Multiple-Byte Character Set) Code Pages, things become a bit
more revealing:
IBM MBCS Code Page Characteristics
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
930 (Japanese) MBCS-Host (Code Pages 300 and 290)
932 (Japanese) MBCS-PC (Code Pages 301 and 897)
933 (Korean) MBCS-Host (Code Pages 834 and 833)
934 (Korean) MBCS-PC (Code Pages 926 and 891)
938 (Traditional Chinese) MBCS-PC (Code Pages 927 and 904)
936 (Simplified Chinese) MBCS-PC (Code Pages 928 and 903)
5031 (Simplified Chinese) MBCS-Host (Code Pages 837 and 836)
5033 (Traditional Chinese) MBCS-Host (Code Pages 835 and 37)
So, you can now see that many of Microsoft's CJK Code Pages are
derived from those established by IBM.
More detailed information on the encoding specifications for
DBCS-Host and DBCS-PC can be found in Sections 3.3.14 and 3.3.15,
respectively.
PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES
The sections below provide detailed information about
compatibility issues between CJK character sets, to include tidbits of
useful information.
One thing to mention first is that conversion to and from
IBM's DBCS-Host (Section 3.3.14) and DBCS-PC (Section 3.3.15)
encodings is table-driven, and fully documented in the following IBM
publication:
o IBM Corporation. "Character Data Representation Architecture - Level
2, Registry." 1993. IBM order number SC09-1391-01.
Unfortunately, the CJK-related tables are not supplied in machine-
readable format, and must be obtained from IBM directly. The only real
compatibility issue is trying to obtain the conversion tables from
IBM.
4.1: JAPANESE
In general, when a Japanese character set was revised,
characters were simply added (usually appended at the end). However,
when JIS C 6226-1978 was revised in 1983 (to become JIS X 0208-1983),
a bit more happened (this is still a controversy).
A detailed treatment of the two main transitions, JIS C 6226-
1978 to JIS X 0208-1983 and JIS X 0208-1983 to JIS X 0208-1990, is
covered in Appendix J of UJIP. I provide machine-readable files that
detail these transitions at the following URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/
An interesting side note here is that there is a reason why
there are many lists that illustrate JIS C 6226-1978 and JIS X 0208-
1983 kanji form differences. While most share the same basic set of
changes, there are some inconsistencies. Well, it turns out that JIS C
6226-1978 had ten printings, and not all of them shared the same kanji
forms. If comparisons between JIS C 6226-1978 and JIS X 0208-1983 were
made using different printings of the JIS C 6226-1978 manual, the
results can differ slightly.
There are also interesting correspondences between JIS X
0208-1990 and JIS X 0212-1990. 28 kanji that vanished during the JIS C
6226-1978 to JIS X 0208-1983 transition (they were replaced by
simplified versions) were restored in JIS X 0212-1990 (at totally
different code points). Appendix J of UJIP discusses this, and a file
at the following URL details the 28 mappings:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/TJ2.jis
4.2: CHINESE (PRC)
The basic PRC standard, GB 2312-80, has been revised, but not
through a later version of the standard. Instead, the revisions were
carried out in the form of three other documents. Specifically, they
are (in order of publication):
o GB 6345.1-86 (see Section 2.2.3)
o GB 8565.2-88 (see Section 2.2.6)
o GB/T 12345-90 (see Section 2.2.7)
Unless you are aware of these documents, figuring out what has been
corrected or added to GB 2312-80 is nearly impossible.
4.3: CHINESE (TAIWAN)
The first question people think of with regard to Big Five and
CNS 11643-1992 is compatibility. It turns out that Planes 1 and 2 of
CNS 11643-1992 are more or less equivalent to Big Five, but a handful
of hanzi are in a different order. The following tables detail the
mapping from Big Five (with the ETen extension) to CNS 11643-1992
(when using this conversion table, keep in mind the encoding space
ranges for both Big Five and CNS 11643-1992):
Big Five Level 1 Correspondence to CNS 11643-1992 Plane 1:
0xA140-0xA1F5 <-> 0x2121-0x2256
0xA1F6 <-> 0x2258
0xA1F7 <-> 0x2257
0xA1F8-0xA2AE <-> 0x2259-0x234E
0xA2AF-0xA3BF <-> 0x2421-0x2570
0xA3C0-0xA3E0 <-> 0x4221-0x4241 # Symbols for control characters
0xA440-0xACFD <-> 0x4421-0x5322 # Level 1 Hanzi BEGIN
0xACFE <-> 0x5753
0xAD40-0xAFCF <-> 0x5323-0x5752
0xAFD0-0xBBC7 <-> 0x5754-0x6B4F
0xBBC8-0xBE51 <-> 0x6B51-0x6F5B
0xBE52 <-> 0x6B50
0xBE53-0xC1AA <-> 0x6F5C-0x7534
0xC1AB-0xC2CA <-> 0x7536-0x7736
0xC2CB <-> 0x7535
0xC2CC-0xC360 <-> 0x7737-0x782C
0xC361-0xC3B8 <-> 0x782E-0x7863
0xC3B9 <-> 0x7865
0xC3BA <-> 0x7864
0xC3BB-0xC455 <-> 0x7866-0x7961
0xC456 <-> 0x782D
0xC457-0xC67E <-> 0x7962-0x7D4B # Level 1 Hanzi END
0xC6A1-0xC6AA <-> 0x2621-0x262A # Circled numerals
0xC6AB-0xC6B4 <-> 0x262B-0x2634 # Parenthesized numerals
0xC6B5-0xC6BE <-> 0x2635-0x263E # Lowercase Roman numerals
0xC6BF-0xC6C0 <-> 0x2723-0x2724 # 213 radicals BEGIN
0xC6C1-0xC6C2 <-> 0x2726, 0x2728
0xC6C3-0xC6C5 <-> 0x272D-0x272F
0xC6C6-0xC6C7 <-> 0x2734, 0x2737
0xC6C8-0xC6C9 <-> 0x273A, 0x273C
0xC6CA-0xC6CB <-> 0x2742, 0x2747
0xC6CC-0xC6CD <-> 0x274E, 0x2753
0xC6CE-0xC6CF <-> 0x2754-0x2755
0xC6D0-0xC6D1 <-> 0x2759-0x275A
0xC6D2-0xC6D3 <-> 0x2761, 0x2766
0xC6D4-0xC6D5 <-> 0x2829-0x282A
0xC6D6-0xC6D7 <-> 0x2863, 0x286C # 213 radicals END
0xC6D8-0xC6E6 -> ****** # Japanese symbols
0xC6E7-0xC77A -> ****** # Hiragana
0xC77B-0xC7F2 -> ****** # Katakana
0xC7F3-0xC875 -> ****** # Cyrillic alphabet
0xC876-0xC878 -> ****** # Symbols
0xC87A -> ****** # Hanzi element
0xC87C -> ****** # Hanzi element
0xC87E-0xC8A1 -> ****** # Hanzi elements
0xC8A3-0xC8A4 -> ****** # Hanzi elements
0xC8A5-0xC8CC -> ****** # Combined numerals
0xC8CD-0xC8D3 -> ****** # Japanese symbols
Big Five Level 1 Correspondences to CNS 11643-1992 Plane 4:
0xC879 <-> 0x2123 # Hanzi element
0xC87B <-> 0x2124 # Hanzi element
0xC87D <-> 0x212A # Hanzi element
0xC8A2 <-> 0x2152 # Hanzi element
Big Five Level 2 Correspondence to CNS 11643-1992 Plane 1:
0xC94A -> 0x4442 # duplicate of 0xA461
Big Five Level 2 Correspondences to CNS 11643-1992 Plane 2:
0xC940-0xC949 <-> 0x2121-0x212A # Level 2 Hanzi BEGIN
0xC94B-0xC96B <-> 0x212B-0x214B
0xC96C-0xC9BD <-> 0x214D-0x217C
0xC9BE <-> 0x214C
0xC9BF-0xC9EC <-> 0x217D-0x224C
0xC9ED-0xCAF6 <-> 0x224E-0x2438
0xCAF7 <-> 0x224D
0xCAF8-0xD6CB <-> 0x2439-0x376E
0xD6CC <-> 0x3E63
0xD6CD-0xD779 <-> 0x3770-0x387D
0xD77A <-> 0x3F6A
0xD77B-0xDADE <-> 0x387E-0x3E62
0xDADF <-> 0x376F
0xDAE0-0xDBA6 <-> 0x3E64-0x3F69
0xDBA7-0xDDFB <-> 0x3F6B-0x4423
0xDDFC -> 0x4176 # duplicate of 0xDCD1
0xDDFD-0xE8A2 <-> 0x4424-0x554A
0xE8A3-0xE975 <-> 0x554C-0x5721
0xE976-0xEB5A <-> 0x5723-0x5A27
0xEB5B-0xEBF0 <-> 0x5A29-0x5B3E
0xEBF1 <-> 0x554B
0xEBF2-0xECDD <-> 0x5B3F-0x5C69
0xECDE <-> 0x5722
0xECDF-0xEDA9 <-> 0x5C6A-0x5D73
0xEDAA-0xEEEA <-> 0x5D75-0x6038
0xEEEB <-> 0x642F
0xEEEC-0xF055 <-> 0x6039-0x6242
0xF056 <-> 0x5D74
0xF057-0xF0CA <-> 0x6243-0x6336
0xF0CB <-> 0x5A28
0xF0CC-0xF162 <-> 0x6337-0x642E
0xF163-0xF16A <-> 0x6430-0x6437
0xF16B <-> 0x6761
0xF16C-0xF267 <-> 0x6438-0x6572
0xF268 <-> 0x6934
0xF269-0xF2C2 <-> 0x6573-0x664C
0xF2C3-0xF374 <-> 0x664E-0x6760
0xF375-0xF465 <-> 0x6762-0x6933
0xF466-0xF4B4 <-> 0x6935-0x6961
0xF4B5 <-> 0x664D
0xF4B6-0xF4FC <-> 0x6962-0x6A4A
0xF4FD-0xF662 <-> 0x6A4C-0x6C51
0xF663 <-> 0x6A4B
0xF664-0xF976 <-> 0x6C52-0x7165
0xF977-0xF9C3 <-> 0x7167-0x7233
0xF9C4 <-> 0x7166
0xF9C5 <-> 0x7234
0xF9C6 <-> 0x7240
0xF9C7-0xF9D1 <-> 0x7235-0x723F
0xF9D2-0xF9D5 <-> 0x7241-0x7244 # Level 2 Hanzi END
0xF9DD-0xF9FE -> ****** # Symbols
Big Five Level 2 Correspondence to CNS 11643-1992 Plane 3:
0xF9D6 <-> 0x4337 # ETen-specific hanzi
0xF9D7 <-> 0x4F50 # ETen-specific hanzi
0xF9D8 <-> 0x444E # ETen-specific hanzi
0xF9D9 <-> 0x504A # ETen-specific hanzi
0xF9DA <-> 0x2C5D # ETen-specific hanzi
0xF9DB <-> 0x3D7E # ETen-specific hanzi
0xF9DC <-> 0x4B5C # ETen-specific hanzi
I adapted the above from material Ross Paterson (rap@doc.ic.ac.uk)
kindly made available at the following URL:
http://www.ifcss.org:8001/www/pub/software/info/cjk-codes/
Check it out. Basically, I just changed the CNS 11643-1992 codes from
decimal row-cell values to hexadecimal codes, and corrected the
mappings to correspond to ETen's Big Five (which is considered to be
the most standard).
It turns out that corrections were made to Big Five (at least
in the ETen and Microsoft implementations thereof) which made it a bit
closer to CNS 11643-1992 as far as character ordering is concerned.
The following six lines of code correspondences:
0xCAF8-0xD6CB <-> 0x2439-0x376E
0xD6CC <-> 0x3E63
0xD6CD-0xD779 <-> 0x3770-0x387D
0xD77A <-> 0x3F6A
0xD77B-0xDADE <-> 0x387E-0x3E62
0xDADF <-> 0x376F
can now be expressed as the following three lines:
0xCAF8-0xD779 <-> 0x2439-0x387D
0xD77A <-> 0x3F6A
0xD77B-0xDBA6 <-> 0x387E-0x3F69
In essence, the ordering of Big Five characters 0xD6CC and 0xDADF were
reversed. This resulted in the same order as found in CNS 11643-1992
Plane 2.
As for the two duplicate hanzi in Big Five (as indicated in
the above tables), they have been placed into a compatibility zone in
ISO 10646-1:1993 (this allows for round-trip conversion). The mapping
is as follows:
Big Five ISO 10646-1:1993
^^^^^^^^ ^^^^^^^^^^^^^^^^
0xC94A -> 0xFA0C
0xDDFC -> 0xFA0D
Speaking of duplicate hanzi, Plane 1 of CNS 11643-1992
contains 213 classical radicals in rows 27 through 29. However, 187 of
them map directly to hanzi code points in Planes 1, 2, and 3 (and
naturally to Big Five). Below is a detailed mapping of these 213
radicals:
Radical CNS 11643 Big Five Radical CNS 11643 Big Five
^^^^^^^ ^^^^^^^^^ ^^^^^^^^ ^^^^^^^ ^^^^^^^^^ ^^^^^^^^
0x2721 -> 0x4421 0xA440 0x282E -> 0x4678 0xA5D8
0x2722 -> 0x2121 (3) ****** 0x282F -> 0x4679 0xA5D9
0x2723 -> 0x2122 (3) 0xC6BF 0x2830 -> 0x467A 0xA5DA
0x2724 -> 0x2123 (3) 0xC6C0 0x2831 -> 0x467B 0xA5DB
0x2725 -> 0x4422 0xA441 0x2832 -> 0x467C 0xA5DC
0x2726 -> 0x2124 (3) 0xC6C1 0x2833 -> 0x2167 (2) 0xC9A8
0x2727 -> 0x4428 0xA447 0x2834 -> 0x467D 0xA5DD
0x2728 -> ****** 0xC6C2 0x2835 -> 0x467E 0xA5DE
0x2729 -> 0x4429 0xA448 0x2836 -> 0x4721 0xA5DF
0x272A -> 0x442A 0xA449 0x2837 -> 0x484C 0xA6CB
0x272B -> 0x442B 0xA44A 0x2838 -> 0x484D 0xA6CC
0x272C -> 0x442C 0xA44B 0x2839 -> 0x484E 0xA6CD
0x272D -> 0x2127 (3) 0xC6C3 0x283A -> 0x484F 0xA6CE
0x272E -> 0x2128 (3) 0xC6C4 0x283B -> 0x2269 (2) 0xCA49
0x272F -> ****** 0xC6C5 0x283C -> 0x4850 0xA6CF
0x2730 -> 0x442D 0xA44C 0x283D -> 0x4851 0xA6D0
0x2731 -> 0x2123 (2) 0xC942 0x283E -> 0x4852 0xA6D1
0x2732 -> 0x442E 0xA44D 0x283F -> 0x4854 0xA6D3
0x2733 -> 0x4430 0xA44F 0x2840 -> 0x4855 0xA6D4
0x2734 -> ****** 0xC6C6 0x2841 -> 0x4856 0xA6D5
0x2735 -> 0x4431 0xA450 0x2842 -> 0x4857 0xA6D6
0x2736 -> 0x2124 (2) 0xC943 0x2843 -> 0x4858 0xA6D7
0x2737 -> 0x2129 (3) 0xC6C7 0x2844 -> 0x485B 0xA6DA
0x2738 -> 0x4432 0xA451 0x2845 -> 0x485C 0xA6DB
0x2739 -> 0x4433 0xA452 0x2846 -> 0x485D 0xA6DC
0x273A -> 0x212A (3) 0xC6C8 0x2847 -> 0x485E 0xA6DD
0x273B -> 0x2125 (2) 0xC944 0x2848 -> 0x485F 0xA6DE
0x273C -> 0x212B (3) 0xC6C9 0x2849 -> 0x4860 0xA6DF
0x273D -> 0x4434 0xA453 0x284A -> 0x4861 0xA6E0
0x273E -> 0x4447 0xA466 0x284B -> 0x4862 0xA6E1
0x273F -> 0x212A (2) 0xC949 0x284C -> 0x4863 0xA6E2
0x2740 -> 0x4448 0xA467 0x284D -> 0x226A (2) 0xCA4A
0x2741 -> 0x4449 0xA468 0x284E -> 0x226F (2) 0xCA4F
0x2742 -> 0x213A (3) 0xC6CA 0x284F -> 0x4865 0xA6E4
0x2743 -> 0x444A 0xA469 0x2850 -> 0x4866 0xA6E5
0x2744 -> 0x444B 0xA46A 0x2851 -> 0x4867 0xA6E6
0x2745 -> 0x444C 0xA46B 0x2852 -> 0x4868 0xA6E7
0x2746 -> 0x444D 0xA46C 0x2853 -> 0x2270 (2) 0xCA50
0x2747 -> 0x213B (3) 0xC6CB 0x2854 -> 0x4B44 0xA8A3
0x2748 -> 0x4450 0xA46F 0x2855 -> 0x4B45 0xA8A4
0x2749 -> 0x4451 0xA470 0x2856 -> 0x4B46 0xA8A5
0x274A -> 0x4452 0xA471 0x2857 -> 0x4B47 0xA8A6
0x274B -> 0x4453 0xA472 0x2858 -> 0x4B48 0xA8A7
0x274C -> 0x212B (2) 0xC94B 0x2859 -> 0x4B49 0xA8A8
0x274D -> 0x4454 0xA473 0x285A -> 0x2524 (2) 0xCBA4
0x274E -> 0x213C (3) 0xC6CC 0x285B -> 0x4B4A 0xA8A9
0x274F -> 0x4456 0xA475 0x285C -> 0x4B4B 0xA8AA
0x2750 -> 0x4457 0xA476 0x285D -> 0x4B4C 0xA8AB
0x2751 -> 0x445A 0xA479 0x285E -> 0x4B4D 0xA8AC
0x2752 -> 0x445B 0xA47A 0x285F -> 0x4B4E 0xA8AD
0x2753 -> 0x213D (3) 0xC6CD 0x2860 -> 0x4B4F 0xA8AE
0x2754 -> 0x213E (3) 0xC6CE 0x2861 -> 0x4B50 0xA8AF
0x2755 -> 0x213F (3) 0xC6CF 0x2862 -> 0x4B51 0xA8B0
0x2756 -> 0x445C 0xA47B 0x2863 -> 0x272F (3) 0xC6D6
0x2757 -> 0x445D 0xA47C 0x2864 -> 0x4B57 0xA8B6
0x2758 -> 0x445E 0xA47D 0x2865 -> 0x4B5C 0xA8BB
0x2759 -> 0x2140 (3) 0xC6D0 0x2866 -> 0x4B5D 0xA8BC
0x275A -> 0x2142 (3) 0xC6D1 0x2867 -> 0x4B5E 0xA8BD
0x275B -> 0x212C (2) 0xC94C 0x2868 -> 0x4F5A 0xAAF7
0x275C -> 0x4540 0xA4DF 0x2869 -> 0x4F5B 0xAAF8
0x275D -> 0x4541 0xA4E0 0x286A -> 0x4F5C 0xAAF9
0x275E -> 0x4542 0xA4E1 0x286B -> 0x4F5D 0xAAFA
0x275F -> 0x4543 0xA4E2 0x286C -> 0x2A7D (3) 0xC6D7
0x2760 -> 0x4545 0xA4E4 0x286D -> 0x4F63 0xAB41
0x2761 -> 0x2167 (3) 0xC6D2 0x286E -> 0x4F64 0xAB42
0x2762 -> 0x4546 0xA4E5 0x286F -> 0x4F65 0xAB43
0x2763 -> 0x4547 0xA4E6 0x2870 -> 0x4F66 0xAB44
0x2764 -> 0x4548 0xA4E7 0x2871 -> 0x5372 0xADB1
0x2765 -> 0x4549 0xA4E8 0x2872 -> 0x5373 0xADB2
0x2766 -> 0x2169 (3) 0xC6D3 0x2873 -> 0x5374 0xADB3
0x2767 -> 0x454A 0xA4E9 0x2874 -> 0x5375 0xADB4
0x2768 -> 0x454B 0xA4EA 0x2875 -> 0x5376 0xADB5
0x2769 -> 0x454C 0xA4EB 0x2876 -> 0x5377 0xADB6
0x276A -> 0x454D 0xA4EC 0x2877 -> 0x5378 0xADB7
0x276B -> 0x454E 0xA4ED 0x2878 -> 0x5379 0xADB8
0x276C -> 0x454F 0xA4EE 0x2879 -> 0x537A 0xADB9
0x276D -> 0x4550 0xA4EF 0x287A -> 0x537B 0xADBA
0x276E -> 0x213F (2) 0xC95F 0x287B -> 0x537C 0xADBB
0x276F -> 0x4551 0xA4F0 0x287C -> 0x586B 0xB0A8
0x2770 -> 0x4552 0xA4F1 0x287D -> 0x586C 0xB0A9
0x2771 -> 0x4553 0xA4F2 0x287E -> 0x586D 0xB0AA
0x2772 -> 0x4554 0xA4F3 0x2921 -> 0x334C (2) 0xD449
0x2773 -> 0x2141 (2) 0xC961 0x2922 -> 0x586E 0xB0AB
0x2774 -> 0x4555 0xA4F4 0x2923 -> 0x334D (2) 0xD44A
0x2775 -> 0x4556 0xA4F5 0x2924 -> 0x586F 0xB0AC
0x2776 -> 0x4557 0xA4F6 0x2925 -> 0x5870 0xB0AD
0x2777 -> 0x4558 0xA4F7 0x2926 -> 0x5E23 0xB3BD
0x2778 -> 0x4559 0xA4F8 0x2927 -> 0x5E24 0xB3BE
0x2779 -> 0x2142 (2) 0xC962 0x2928 -> 0x5E25 0xB3BF
0x277A -> 0x455A 0xA4F9 0x2929 -> 0x5E26 0xB3C0
0x277B -> 0x455B 0xA4FA 0x292A -> 0x5E27 0xB3C1
0x277C -> 0x455C 0xA4FB 0x292B -> 0x5E28 0xB3C2
0x277D -> 0x455D 0xA4FC 0x292C -> 0x6327 0xB6C0
0x277E -> 0x4668 0xA5C8 0x292D -> 0x6328 0xB6C1
0x2821 -> 0x4669 0xA5C9 0x292E -> 0x6329 0xB6C2
0x2822 -> 0x466A 0xA5CA 0x292F -> 0x4155 (2) 0xDCB0
0x2823 -> 0x466B 0xA5CB 0x2930 -> 0x4875 (2) 0xE0EF
0x2824 -> 0x466C 0xA5CC 0x2931 -> 0x676F 0xB9A9
0x2825 -> 0x466D 0xA5CD 0x2932 -> 0x6770 0xB9AA
0x2826 -> 0x466E 0xA5CE 0x2933 -> 0x6771 0xB9AB
0x2827 -> 0x4670 0xA5D0 0x2934 -> 0x6B7C 0xBBF3
0x2828 -> 0x4674 0xA5D4 0x2935 -> 0x6B7D 0xBBF4
0x2829 -> 0x225B (3) 0xC6D4 0x2936 -> 0x702F 0xBEA6
0x282A -> 0x225C (3) 0xC6D5 0x2937 -> 0x733E 0xC073
0x282B -> 0x4675 0xA5D5 0x2938 -> 0x733F 0xC074
0x282C -> 0x4676 0xA5D6 0x2939 -> 0x6142 (2) 0xEFB6
0x282D -> 0x4677 0xA5D7
4.4: KOREAN
The 268 duplicate hanja in KS C 5601-1992 can cause problems
when converting to and from other CJK character sets. When converting
from KS C 5601-1992, two or more hanja can collapse into a single code
point. When converting these 268 hanja to KS C 5601-1992, a decision
about which KS C 5601-1992 code point to map to must be made. The only
exception to this is mapping to and from ISO 10646-1:1993. That
standard encodes these 268 duplicate hanja in a compatibility zone,
namely from 0xF900 through 0xFA0B.
The following is a listing of 262 hanja that map to two or
more code points (four map to three code points, and one maps to four:
a total of 268 redundantly-encoded hanja) in KS C 5601-1992:
Standard Extra Standard Extra Standard Extra
^^^^^^^^ ^^^^^ ^^^^^^^^ ^^^^^ ^^^^^^^^ ^^^^^
0x4A39 -> 0x4D4F 0x5573 -> 0x6631 0x573C -> 0x6B29
0x4B3D -> 0x7A22 0x5574 -> 0x6633 0x573E -> 0x6B3A
0x4C38 -> 0x7A66 0x5575 -> 0x6637 0x573F -> 0x6B3B
0x4C5A -> 0x4B56 0x5576 -> 0x6638 0x5740 -> 0x6B3D
0x4C78 -> 0x5050 0x5579 -> 0x663C 0x5741 -> 0x6B41
0x4D7A -> 0x4E2D 0x557B -> 0x6646 0x5743 -> 0x6B42
0x4E29 -> 0x7C29 0x557C -> 0x6647 0x5744 -> 0x6B46
0x4F23 -> 0x4F7B 0x557E -> 0x6652 0x5745 -> 0x6B47
0x4F4F -> 0x5022 0x5621 -> 0x6656 0x5747 -> 0x6B4C
0x5038 0x5622 -> 0x6659 0x5748 -> 0x6B4F
0x5142 -> 0x4B50 0x5623 -> 0x665F 0x5749 -> 0x6B50
0x5151 -> 0x505D 0x5624 -> 0x6661 0x574A -> 0x6B51
0x5159 -> 0x547C 0x5625 -> 0x6665 0x574C -> 0x6B58
0x5167 -> 0x552B 0x5626 -> 0x6664 0x574D -> 0x5270
0x522F -> 0x5155 0x5627 -> 0x6666 0x574E -> 0x5271
0x5233 -> 0x657C 0x5628 -> 0x6668 0x574F -> 0x5272
0x5234 -> 0x6644 0x562A -> 0x666A 0x5750 -> 0x5273
0x5235 -> 0x664A 0x562B -> 0x666B 0x5752 -> 0x5274
0x5236 -> 0x665C 0x562D -> 0x666F 0x5753 -> 0x5275
0x5237 -> 0x6676 0x562E -> 0x6671 0x5754 -> 0x5277
0x523A -> 0x6677 0x562F -> 0x6675 0x5755 -> 0x5278
0x523B -> 0x5638 0x5631 -> 0x6679 0x5757 -> 0x6C26
0x672C 0x5633 -> 0x6721 0x5759 -> 0x6C27
0x5241 -> 0x564D 0x5634 -> 0x6726 0x575B -> 0x6C2A
0x5263 -> 0x6871 0x5635 -> 0x6729 0x575D -> 0x6C30
0x526E -> 0x6A74 0x5637 -> 0x672A 0x575E -> 0x6C31
0x526F -> 0x6B2A 0x563A -> 0x672D 0x5762 -> 0x6C35
0x527A -> 0x6C32 0x563B -> 0x6730 0x5765 -> 0x6C38
0x527B -> 0x6C49 0x563C -> 0x673F 0x5767 -> 0x6C3A
0x527C -> 0x6C4A 0x563E -> 0x6746 0x576A -> 0x6C40
0x527E -> 0x7331 0x5640 -> 0x6747 0x576B -> 0x6C41
0x5321 -> 0x552E 0x5642 -> 0x674B 0x576C -> 0x6C45
0x5358 -> 0x7738 0x5643 -> 0x674D 0x576E -> 0x6C46
0x536B -> 0x7748 0x5644 -> 0x674F 0x5770 -> 0x6C55
0x5378 -> 0x7674 0x5645 -> 0x6750 0x5772 -> 0x6C5D
0x5441 -> 0x5466 0x5647 -> 0x6753 0x5773 -> 0x6C5E
0x5457 -> 0x7753 0x5649 -> 0x675F 0x5774 -> 0x6C61
0x547A -> 0x5154 0x564A -> 0x6764 0x5776 -> 0x6C64
0x547B -> 0x5158 0x564B -> 0x6766 0x5777 -> 0x6C67
0x547D -> 0x515B 0x564C -> 0x523E 0x5778 -> 0x6C68
0x547E -> 0x515C 0x564F -> 0x5242 0x5779 -> 0x6C77
0x5521 -> 0x515D 0x5650 -> 0x5243 0x577A -> 0x6C78
0x5522 -> 0x515E 0x5653 -> 0x5244 0x577C -> 0x6C7A
0x5523 -> 0x515F 0x5654 -> 0x5246 0x5821 -> 0x6D21
0x5524 -> 0x5160 0x5655 -> 0x5247 0x5822 -> 0x6D22
0x5526 -> 0x5163 0x5656 -> 0x5248 0x5823 -> 0x6D23
0x5527 -> 0x5164 0x5657 -> 0x5249 0x5A72 -> 0x5B64
0x5528 -> 0x5165 0x5658 -> 0x524A 0x5C56 -> 0x5D25
0x552A -> 0x5166 0x565A -> 0x524B 0x5C5F -> 0x7870
0x552C -> 0x5168 0x565B -> 0x524D 0x5C74 -> 0x5D55
0x552D -> 0x5169 0x565C -> 0x524E 0x5D41 -> 0x5B45
0x552F -> 0x516A 0x565E -> 0x524F 0x5F2F -> 0x616D
0x5530 -> 0x516B 0x565F -> 0x5250 0x5F52 -> 0x6D6E
0x5531 -> 0x516D 0x5660 -> 0x5251 0x5F5D -> 0x5F61
0x5534 -> 0x516F 0x5661 -> 0x5252 0x5F63 -> 0x5E7E
0x5535 -> 0x5170 0x5662 -> 0x5253 0x6063 -> 0x612D
0x5536 -> 0x5172 0x5663 -> 0x5254 0x6672
0x5539 -> 0x5176 0x5665 -> 0x5255 0x607D -> 0x5F68
0x553D -> 0x517A 0x5666 -> 0x5256 0x6163 -> 0x574B
0x5540 -> 0x517C 0x5667 -> 0x5257 0x6B52
0x5541 -> 0x517D 0x566B -> 0x5259 0x6226 -> 0x5E7C
0x5543 -> 0x517E 0x566C -> 0x525A 0x6326 -> 0x6429
0x5544 -> 0x5222 0x566F -> 0x525E 0x635B -> 0x723D
0x5545 -> 0x5223 0x5670 -> 0x525F 0x6427 -> 0x727A
0x5546 -> 0x5227 0x5671 -> 0x5261 0x6442 -> 0x6777
0x5547 -> 0x5228 0x5674 -> 0x5262 0x6445 -> 0x5162
0x5548 -> 0x5229 0x5675 -> 0x6867 0x5525
0x5549 -> 0x522A 0x5676 -> 0x6868 0x6879
0x554D -> 0x522B 0x5677 -> 0x6870 0x6534 -> 0x652E
0x554E -> 0x522D 0x5679 -> 0x6877 0x6636 -> 0x6C2F
0x5552 -> 0x5232 0x567A -> 0x687B 0x6728 -> 0x6071
0x5553 -> 0x6531 0x567B -> 0x687E 0x6856 -> 0x6A41
0x5554 -> 0x6532 0x567E -> 0x6927 0x6C36 -> 0x5764
0x5555 -> 0x6539 0x5721 -> 0x692C 0x6C56 -> 0x666C
0x5557 -> 0x653B 0x5723 -> 0x694C 0x6D29 -> 0x7427
0x5558 -> 0x653C 0x5724 -> 0x5264 0x6D33 -> 0x6E5B
0x5559 -> 0x6544 0x5726 -> 0x5265 0x6F37 -> 0x746E
0x555D -> 0x654E 0x5727 -> 0x5266 0x7263 -> 0x6375
0x555E -> 0x6550 0x5728 -> 0x5267 0x7333 -> 0x4B67
0x555F -> 0x6552 0x5729 -> 0x5268 0x7351 -> 0x5F33
0x5561 -> 0x6556 0x572B -> 0x5269 0x742C -> 0x7676
0x5564 -> 0x657A 0x572C -> 0x526A 0x7658 -> 0x6421
0x5565 -> 0x657B 0x5730 -> 0x526B 0x7835 -> 0x5C25
0x5566 -> 0x657E 0x5731 -> 0x6A65 0x786C -> 0x785B
0x5569 -> 0x6621 0x5733 -> 0x6A77 0x7932 -> 0x5D74
0x556B -> 0x6624 0x5735 -> 0x6A7C 0x7A3C -> 0x7A21
0x556C -> 0x6627 0x5736 -> 0x6A7E 0x7B29 -> 0x6741
0x556F -> 0x662D 0x5738 -> 0x6B24 0x7C41 -> 0x4D68
0x5571 -> 0x662F 0x573A -> 0x6B27 0x7D3B -> 0x6977
0x5572 -> 0x6630
The above table represents a weekend of my time (but time well spent,
in my opinion).
4.5: ISO 10646-1:1993
The Chinese character subset of ISO 10646-1:1993
has excellent round-trip conversion capability with the various
national character sets. Those national character sets with duplicate
characters, such as KS C 5601-1992 (268 hanja) and Big Five (2 hanzi),
have corresponding code points in ISO 10646-1:1993 within
a compatibility zone. See Sections 4.3 and 4.4 for more details.
Other issues regarding ISO 10646-1:1993 have to do with proper
character rendering (that is, how characters are displayed, printed,
or otherwise imaged). Many (sometimes) subtle character form
differences have been collapsed under ISO 10646-1:1993. Language or
locale was not one of the factors used in performing Han Unification.
This means that it is nearly impossible to create a single ISO 10646-1:
1993 font that meets the character form criteria of each of the four
CJK locales. An ISO 10646-1:1993 code point is not enough information
to render a Chinese character. If the font was specifically designed
for a single locale, it is a non-problem, but if there is any CJK
intent, text must be flagged for language or locale.
4.6: UNICODE
One of the most interesting (and major) differences between
the current three flavors of Unicode are the number and arrangement of
pre-combined hangul. The following table provides a summary of the
differences:
Unicode Number of Pre-combined Hangul UCS-2 Ranges
^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^
Version 1.0 2,350 Basic Hangul 0x3400-0x3D3D
Version 1.1 2,350 Basic Hangul 0x3400-0x3D3D
1,930 Supplemental Hangul A 0x3D2E-0x44B7
2,376 Supplemental Hangul B 0x44BE-0x4DFF
Version 2.0 11,172 Hangul 0xAC00-0xD7A3
Of the above three versions, the most controversial is Version 2.0.
Why? Because it is located in the user-defined range of Unicode
(O-Zone: 16,384 code points in 0xA000-0xDFFF), and occupies
approximately two-thirds of its space.
The information in the above table is courtesy of the
following useful document:
ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt
The same file is also mirrored at the following URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt
4.7: CODE CONVERSION TIPS
There are two types of conversions that can be performed. The
first type is converting between different encodings for the same
character set. This is usually without problems (but not always). The
second type is converting from one character set to another (it is not
usually relevant whether the underlying encoding has changed or not).
This usually involves the handling of characters that are in one
character set, but not the other. So, what to do?
I suggest JConv for handling Japanese code conversion (this
means converting between JIS, Shift-JIS, and EUC encodings). This is
in the category of different encodings for the same character set. The
following URLs provide executables or source code:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-30.hqx
ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-dd-181.hqx
ftp://ftp.ora.com/pub/examples/nutshell/ujip/dos/jconv.exe
ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/jconv.c
There are other programs available that do the same basic thing as
JConv, such as kc and nkf. They are available at the following URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/
For software and tables that handles Chinese code conversion
(this includes conversion to and from Japanese), I suggest browsing at
the following URLs:
ftp://etlport.etl.go.jp/pub/iso-2022-cn/convert/
ftp://ftp.ifcss.org/pub/software/dos/convert/
ftp://ftp.ifcss.org/pub/software/mac/convert/
ftp://ftp.ifcss.org/pub/software/ms-win/convert/
ftp://ftp.ifcss.org/pub/software/unix/convert/
ftp://ftp.ifcss.org/pub/software/vms/convert/
ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/
ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/
http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html
The latter URL has FTP links to tables created by Koichi Yasuoka
(yasuoka@kudpc.kyoto-u.ac.jp).
The following URLs provide utilities or tables for converting
between various Korean encodings (the last represent the same file):
ftp://cair-archive.kaist.ac.kr/pub/hangul/code/
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt
ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt
A popular Korean code conversion utility seems to be "hcode" by
June-Yub Lee (jylee@cims.nyu.edu).
Finally, the following URLs provide many Unicode- and CJK-
related mapping tables:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/
ftp://ftp.ora.com/pub/examples/nutshell/ujip/unicode/
ftp://unicode.org/pub/MappingTables/
http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html
Note that the official and authoritative Unicode mapping tables (from
Unicode values to various international, national and vendor
standards) are maintained by the Unicode Consortium at the following
URL:
ftp://unicode.org/pub/MappingTables/
Version 2.0 of "The Unicode Standard" (to be published by Addison-
Wesley shortly) will include these mapping tables on CD-ROM.
PART 5: CJK-CAPABLE OPERATING SYSTEMS
The first step in being able to display CJK text is to obtain
an operating system that handles such text (or an application that
sets up its own CJK-capable environment). Below I describe how
different types of machines can handle CJK text.
Actually, for the first few releases of CJK.INF, these
subsections will be far from complete (some may even be empty!). The
purpose of CJK.INF is to provide detailed information on character set
standards and encoding systems, so I therefore consider this sort of
information secondary.
5.1: MS-DOS
I am not aware of any CJK-capable MS-DOS operating system, but
localized versions do exist. CJK support has been introduced with
Microsoft's Windows operating system (see Section 5.2).
5.2: WINDOWS
Microsoft has CJK versions of its Windows operating system
available. The latest versions of their Windows operating system are
called Windows 95 and Windows NT. Windows 95 supports the same
character sets and encodings as in Windows Version 3.1 -- Windows NT
supports Unicode (ISO 10646-1:1993). Contact Microsoft Corporation for
more details. The URL of their WWW Home Page is:
http://www.microsoft.com/
Nadine Kano's "Developing International Software for Windows 95 and
Windows NT" provides abundant reference material for how CJK is
supported in Windows 95 and Windows NT. Check it out.
TwinBridge is a package that adds CJK functionality to non-CJK
Windows. Demo versions of TwinBridge for Japanese and Chinese are at
the following URLs:
ftp://ftp.netcom.com/pub/tw/twinbrg/Japanese/demo/tbjdemo.zip
ftp://ftp.netcom.com/pub/tw/twinbrg/Chinese/demo/tbcdemo.zip
Another useful CJK add-on for Windows 95 is NJWIN (see Section
7.10) by Hongbo Data Systems.
5.3: MACINTOSH
Macintosh is well-known as a computer that was designed to
handle multilingual texts. There are currently fully-localized
operating systems available for Japanese (KanjiTalk), Chinese
(simplified and traditional available), and Korean (HangulTalk). In
addition, Apple has developed "Language Kits" (*LK) for Chinese (CLK)
and Japanese (JLK). A Korean Language Kit (KLK) will be released
shortly.
These localized operating systems can usually be installed
together in order to make your system CJK-capable.
The common portion of these CJK-capable operating systems is a
technology Apple calls "WorldScript II" ("WorldScript I" is for one-
byte scripts). It provides the basic one- and two-byte functionality.
5.4: UNIX AND X WINDOWS
The typical encoding system used on UNIX and X Windows is EUC
(see Section 3.2). Many systems, such as IBM's AIX, can be configured
to handle both EUC and Shift-JIS (for Japanese). In addition, X11R6 (X
Window System, Version 11, Release 6) has many CJK-capable features.
If you have a fast PC and a good amount of RAM (more than
4MB), you should consider replacing MS-DOS (and Microsoft Windows,
too, if you have it) with Linux, which is a full-blown UNIX operating
system that runs on Intel processors. You can even run X Windows
(X11R6). "Running Linux" by Matt Welsh and Lar Kaufman is an excellent
guide to installing and using Linux. The companion volume, "Linux
Network Administrator's Guide" by Olaf Kirch is also useful. Because
there is a fine line -- or no line at all -- between a user and System
Administrator when using Linux, "Essential System Administration"
Second Edition by AEleen Frisch is a must-have.
Linux and Linux information are available at the following
URLs:
ftp://sunsite.unc.edu/pub/Linux/
http://sunsite.unc.edu/mdw/linux.html
I personally use Linux, and find it quite useful and powerful. My bias
comes from being a UNIX user. But, you can't beat the price (free),
and all of my favorite text-manipulation tools (such as Perl) are
readily available.
5.5: OTHERS
No information yet.
PART 6: CJK TEXT AND INTERNET SERVICES
Part 5 described how CJK text is handled on a machine
internally, but this part goes into the implications of handling such
text externally, namely for information interchange purposes. This
boils down to handling CJK text on Internet services.
For more detailed information on how these and other Internet
services are used, I suggest "The Whole Internet User's Guide &
Catalog" by Ed Krol. For more information on setting up and
maintaining these and other Internet services, I suggest "Managing
Internet Information Services" by Cricket Liu et al.
6.1: ELECTRONIC MAIL
The most basic Internet service is electronic mail (henceforth
to be called "e-mail"), which is virtually guaranteed to be available
to all users regardless of their system.
Several Internet standards (called RFCs, short for Request For
Comments) have been developed to describe how CJK text is to be handled
over e-mail systems (see Section A.3.4).
The bottom-line is that most e-mail systems do not support
8-bit characters (that is, bytes that have their 8th bit set). Some do
offer 8-bit support, but you can never know what path your e-mail
might take while on route to its recipient. This means that 7-bit ISO
2022 (or equivalent) is the ideal encoding to use when sending CJK
text through e-mail. If your operating system processes another
encoding system, you must convert from that encoding to one that is
compatible with 7-bit ISO 2022.
However, even 7-bit ISO 2022 encoding can get mangled by
mail-routing software -- the escape character, sometimes even part of
the escape sequence (meaning more than just the escape character), is
stripped. The JConv tool described in Section 4.7 restores stripped
escape sequences for Japanese 7-bit ISO 2022.
If your mailing software is MIME-compliant, there is a means
to identify the character set and encoding of the message using the
"charset" parameter. Some valid "charset" values include the
following:
o iso-2022-jp (see Section 3.1.3)
o iso-2022-jp-2 (see Section 3.1.3)
o iso-2022-kr (see Section 3.1.4)
o iso-2022-cn (see Section 3.1.5)
o iso-2022-cn-ext (see Section 3.1.5)
o iso-8859-1
Insertion of these values should happen automatically.
A last-ditch effort to send CJK text through e-mail is to use
uuencode or Base64 encoding (see Section 3.3.13). Base64 is something
that is usually done automatically by mailing software -- explicit
Base64 encoding is not common. The recipient must then run uudecode or
a Base64 decoder to get the original file (if such utilities are
available).
6.2: USENET NEWS
Usenet News follows many of the same requirements as e-mail,
namely that 7-bit ISO 2022 encoding is ideal. However, some newsgroups
use specific encoding methods, such as:
alt.chinese.text (HZ encoding used for Chinese text)
alt.chinese.text.big5 (Big Five encoding used for Chinese text)
chinese.flame (UTF-7)
chinese.text.unicode (UTF-8)
Also, the newsgroups in Korean (all begin with "han.*") use EUC (EUC-
KR) because the news-handling software in Korea has been designed to
handle eight-bit characters correctly. Mailing list versions of Korean
newsgroups are likely to use ISO-2022-KR encoding.
One common problem with Usenet News is that the escape
characters used in 7-bit ISO 2022 encoding are sometimes stripped,
usually by the software used to post the article. This can be quite
annoying. There are programs available, such as JConv, that repair
such files by restoring the escape characters.
Another common problem are news readers that do not allow
escape characters to function. One simple solution is to "pipe" the
article through a display command, such as "more," "page," "less," or
"cat." This is done by typing a "pipe" character (|) followed by the
command name anywhere within the article being displayed.
6.3: GOPHER
The World-Wide Web (WWW) has almost eliminated the need for
using Gopher, so I won't discuss it here. Not that I don't appreciate
Gopher servers, but what I mean is that WWW browsing software permits
access to Gopher sites.
6.4: WORLD-WIDE WEB
First, there are two types of WWW browsers available. The most
common type is the graphics-based browser (examples include Mosaic and
Netscape). Graphics-based browsers have the unfortunate requirement of
a TCP/IP (SLIP and PPP support these protocols) connection. Lynx and
the W3 client for Emacs, which are text-based browsers, can be run
from the host computer through a standard terminal connection. They
don't display all the pretty pictures that folks put into their WWW
documents, but you get all the text (this is, in many ways, a blessing
in disguise -- transferring graphics is what slows down graphics-based
browsers the most). When the W3 client is run using Mule, it becomes a
fully CJK-capable WWW browser. Both Lynx and the W3 client for Emacs
are freely available. A Japanese-capable Lynx is available at the
following URL:
ftp://ftp.ipc.chiba-u.ac.jp/pub.asada/www/lynx/
There is also a WWW page that provides information on Japanese-capable
Lynx. Its URL is as follows:
http://www.icsd6.tj.chiba-u.ac.jp/lynx/
When WWW documents first came online, there was no method for
handling CJK character sets. This has, fortunately, changed. As of
this writing, two commercial WWW browsers support Japanese. They are
Infomosaic by Fujitsu Limited, and Netscape Navigator by Netscape
Communications Corporation (Version 1.1 added Japanese support). Both
are graphics-based browsers. The former can be ordered at the
following URL:
http://www.fujitsu.co.jp/
The latter can be found at the following URLs:
http://www.netscape.com/
ftp://ftp.netscape.com/
One can also use a delegate server to *filter* Japanese codes
to the one supported by your browser. It is also possible to
"Japanize" existing WWW browsers using assorted tools and patches.
Katsuhiko Momoi (momoi@tigger.stcloud.msus.edu) has authored an
excellent guide to Japanizing WWW browsers. Its URL is:
http://condor.stcloud.msus.edu:20020/netscape.html
I *highly* suggest reading it.
Japanese-capable WWW browsers support automatic detection of
the three Japanese encoding methods (JIS, Shift-JIS, and EUC). Hey,
but, what about support for the "C" and "K" of CJK? Attempting to
answer this question provides us an answer to another question: "What
is the best encoding method to use for CJK WWW documents?"
Encoding methods such as EUC and Shift-JIS provide for mixing
only two character sets. This is because they provide no way to *flag*
or *tag* text for locale (character set) information. Without flagging
information, it is impossible to distinguish Japanese EUC from Chinese
or Korean EUC. However, the escape sequences used in 7-bit ISO 2022
encoding explicitly provide locale information. 7-bit ISO 2022 is
ideal for static documents, which is exactly what one finds on WWW.
My personal recommendation (for the short-term) is to compose
WWW documents (also called HTML documents; HTML stands for Hyper Text
Markup Language) using 7-bit ISO 2022 encoding. The escape sequences
themselves act as explicit flags that indicate locale. However, some
WWW clients are confused by 7-bit ISO 2022 encoding, but the products
by Netscape Communications and Fujitsu Limited prove that this can
work. See the following URL for a description of this problem:
http://www.ntt.jp/japan/note-on-JP/LibWWW-patch.html
Check out the following URLs for information on and proposals
for international support for WWW:
http://www.ebt.com:8080/docs/multilingual-www.html
http://www.w3.org/hypertext/WWW/International/Overview/
There is currently an RFC in the works (called an Internet
Draft) to address the problem of internationalizing HTML by using
Unicode. It is very promising. The latest draft is available at the
following URLs:
ftp://ds.internic.net/internet-drafts/draft-ietf-html-i18n-04.txt.Z
ftp://ftp.isi.edu/internet-drafts/draft-ietf-html-i18n-04.txt
ftp://munnari.oz.au/internet-drafts/draft-ietf-html-i18n-04.txt.Z
ftp://nic.nordu.net/internet-drafts/draft-ietf-html-i18n-04.txt
Note that some have been compressed.
6.5: FILE TRANSFER TIPS
Although CJK encoding systems such as Shift-JIS and EUC make
extensive use of 8-bit bytes, that does not mean that you need to
treat the data as binary. Such files are simply to be treated as text,
and should be transferred in text mode (for example, FTP's ASCII mode,
which is also called "Type A Transfer").
When text files are transferred in binary mode (such as FTP's
BINARY mode, which is also called Type I Transfer"), line termination
characters are left unaltered. For example, when transferring a text
file from UNIX to Macintosh, a text transfer will translate the UNIX
newline (0x0A) characters to Macintosh carriage return (0x0D)
characters, but a binary transfer will make no such modifications.
Text-style conversion is typically desired.
The most common types of files that need to be handled as
binary include tar archives (*.tar), compressed files (*.Z, *.gz,
*.zip, *.zoo, *.lzh, and so on), and executables (*.exe, *.bin, and so
on).
PART 7: CJK TEXT HANDLING SOFTWARE
This section describes various CJK-capable software packages.
I expect this section to grow with future versions of this document. I
define "CJK-capable" as being able to support Chinese, Japanese, and
Korean text.
The descriptions I provide below are intentionally short. You
are encouraged to use the information pointers to obtain further
information or the software itself.
7.1: MULE
Mule (multilingual enhancement to GNU Emacs), written by
Kenichi Handa (handa@etl.go.jp), is the first (and only?) CJK-capable
editor for UNIX systems, and is freely available under the terms of
the GNU General Public License. Mule was developed from Nemacs
(Nihongo Emacs).
Mule is available at the following URL:
ftp://etlport.etl.go.jp/pub/mule/
Mule, beginning with Version 2.2, includes handy utilities
(any2ps and m2ps) for printing files in any of the encodings supported
by Mule (which is a lot of encodings, by the way). These programs use
BDF fonts. See the beginning of Part 2 for a list of URLs that have
CJK BDF fonts.
GNU Emacs is a fine editor, and Mule takes it several steps
further by providing multilingual support. I personally use Mule
together with SKK (for Japanese input) -- it is a superb combination.
7.2: CNPRINT
CNPRINT, developed by Yidao Cai (cai@neurophys.wisc.edu), is a
utility to print CJK text (or convert it to a PostScript file), and is
available for MS-DOS, VMS, and UNIX systems. A wide range of encoding
methods are supported by CNPRINT.
CNPRINT is available at the following URLs:
ftp://ftp.ifcss.org/pub/software/{dos,unix,vms}/print/
ftp://neurophys.wisc.edu/[public.cn]/
7.3: MASS
MASS (Multilingual Application Support Service), developed at
the National University of Singapore, is a suite of software tools
that speed and ease the development of UNIX-based CJK (actually, more
than just CJK) applications. It supports a wide variety of character
sets and encodings, including ISO 10646-1:1993 (UCS-2, UTF-7, and
UTF-8), EACC, and CCCII.
More information on MASS, to include contact information for
its developers, can be found at the following URL:
http://www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html
7.4: ADOBE TYPE MANAGER (ATM)
Adobe Type Manager for Macintosh, beginning with Version 3.8,
is CJK-capable (as long as the underlying operating system is CJK-
capable). Actually, ATM generically supports CID-keyed fonts, which
are based on a newly-developed file specification for fonts with large
numbers of characters (like CJK fonts). See Section 7.9 for more
details.
ATM is very easy to obtain. It is bundled with fonts and
applications from Adobe Systems (chances are you have ATM if you
recently purchased an Adobe product). But what about Windows? The
Windows version of ATM should soon follow with identical
functionality.
7.5: MACINTOSH SOFTWARE
WorldScript II, a System Extension introduced with System 7,
provides multi-byte script handling, namely CJK support. If a
Macintosh product claims to support WorldScript II, chances are it is
CJK-capable (provided that your operating system has the necessary
extensions loaded).
The CJK encodings that are supported by WorldScript II capable
applications are the same as made available by the underlying
Macintosh operating system. No import/export of other encodings is
supported at the operating system level. You must run separate
conversion utilities for both import and export. Anyway, below are
some products that are known to be CJK capable.
Nisus Writer, written by Nisus Software, is fully CJK-capable
as long as you have the appropriate scripts installed (such as CLK for
Chinese or JLK for Japanese). A "Language Key" (read "dongle") is also
required for Chinese and Korean (and some one-byte scripts such as
Arabic and Hebrew). A demo version of Nisus Writer is available at the
following URL:
ftp://ftp.nisus-soft.com/pub/nisus/demos/
Give it a try! Updates are also available at the same FTP site. Nisus
Software can be contacted using the following e-mail address or
through their WWW page:
info@nisus-soft.com
http://www.nisus-soft.com/
I also suggest reading "The Nisus Way" by Joe Kissell. Chapter 13
provides detailed information about using Nisus Writer with
WorldScript, and includes a CD-ROM containing among other things a
trial (expires after 90 days) version of Nisus Writer and a
non-expiring version of Nisus Compact.
ClarisWorks by Claris Corporation, beginning with Version 4.0,
is compatible with WorldScript II and all Apple language kits. This
translates into full CJK support. The following URL provides a trial
version of ClarisWorks:
ftp://ftp.claris.com/pub/USA-Macintosh/Trial_Software/
The following URL has detailed information on this and other Claris
products:
http://www.claris.com/
The latest version of WordPerfect by Novell Incorporated is
also compatible with WorldScript II. The following URL has detailed
information:
http://wp.novell.com/tree.htm
7.6: MACBLUE TELNET
Although MacBlue Telnet (a modified version of NCSA Telnet) is
Macintosh software, I describe it separately because it does not
require the various Apple Language Kits or localized operating
systems. There are also input methods, adapted from cxterm (see
Section 7.7), available that cover the CJK spectrum (Japanese,
Simplified Chinese, Traditional Chinese, and Korean).
MacBlue Telnet is available at the following URL:
ftp://ftp.ifcss.org/pub/software/mac/networking/MacBlueTelnet/
Its associated CJK input methods are at the following URL:
ftp://ftp.ifcss.org/pub/software/mac/input/
7.7: CXTERM
This program, cxterm, is a CJK-capable xterm for X Windows
(works with X11R4, X11R5, and X11R6). It is based on the X11R6 xterm.
It is available at the following URL:
ftp://ftp.ifcss.org/pub/software/x-win/cxterm/
The following URL is for a program that adds Unicode
capability to cxterm:
ftp://ftp.ifcss.org/pub/software/unix/convert/hztty-2.0.tar.gz
The following URL adds support for other encodings to cxterm:
ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz
7.8: UW-DBM
UW-DBM, for Windows 3.1, Windows 95, and Windows NT, is a
program that allows users to handle Chinese (Big Five, GB-2312-80, or
HZ code), Japanese (Shift-JIS), and Korean (KS C 5601-1992)
simultaneously. More information on UW-DBM is available at the
following URL:
http://www.gy.com/ccd/win95/cjkw95.htm
A demo version of UW-DBM is available at the following URL:
ftp://ftp.aimnet.com/pub/users/chinabus/uwdbm40.zip
7.9: POSTSCRIPT
With the introduction of CID-keyed Font Technology, PostScript
has become fully CJK capable.
Adobe Systems has developed the following CJK character
collection for CID-keyed fonts (font developers are encouraged to
conform to these specifications):
Character Collection CIDs Supported Character Sets & Encodings
^^^^^^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Adobe-GB1-1 9,897 GB 2312-80 and GB/T 12345-90; 7-bit ISO
2022 and EUC
Adobe-CNS1-0 14,099 Big Five (ETen extensions) and CNS
11643-1992 Planes 1 and 2; Big Five,
7-bit ISO 2022, and EUC
Adobe-Japan1-2 8,720 JIS X 0208-1990; Shift-JIS, 7-bit ISO
2022, and EUC
Adobe-Japan2-0 6,068 JIS X 0212-1990; 7-bit ISO 2022 and EUC
Adobe-Korea1-1 18,155 KS C 5601-1992 (Macintosh extensions
plus Johab); 7-bit ISO 2022, EUC, UHC,
and Johab
Note that Macintosh and Windows do not support any of the encodings
for Adobe-Japan2-0, thus fonts based on that specification are
unusable for those platforms.
Adobe Systems also have a few things in the works (that is,
they are either proposed or in draft form), all of which are
supplements to above character collections (that is, they add CIDs):
Character Collection CIDs Supported Character Sets & Encodings
^^^^^^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Adobe-CNS1-1 +6,018 Add CNS 11643-1992 Plane 3 support (30
of the 6,148 hanzi are in Adobe-CNS1-0)
To find out more about these CJK character collections or
CID-keyed font technology, contact the Adobe Developers Association.
Several CID-related documents have been published. ADA's contact
information is as follows:
Adobe Developers Association
Adobe Systems Incorporated
1585 Charleston Road
P.O. Box 7900
Mountain View, CA 94039-7900
USA
+1-415-961-4111 (phone)
+1-415-967-9231 (facsimile)
devsupp-person@adobe.com
http://www.adobe.com/Support/
Adobe Systems has recently developed the CID SDK (CID Software
Developers Kit), which is on a single CD-ROM. Contact the Adobe
Developers Association for information on obtaining a copy.
The complete CID-keyed font file specification and an overview
document are available at the following URLs (as a PostScript or PDF
[Adobe Acrobat] file, respectively):
ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PSfiles/
ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PDFfiles/
The file names (not provided above due to URL length) are:
5014.CMap_CIDFont_Spec.ps (complete CID engineering specification)
5014.CMap_CIDFont_Spec.pdf
5092.CID_Overview.ps (CID technology overview)
5092.CID_Overview.pdf
Other related files, most character collection specifications, are
available only in PDF format at the latter URL indicated above:
5004.AFM_Spec.pdf (Includes CID-keyed AFM specification)
5078b.pdf (Adobe-Japan1-2 character collection)
5079b.pdf (Adobe-GB1-0 character collection)
5080b.pdf (Adobe-CNS1-0 character collection)
5093b.pdf (Adobe-Korea1-0 character collection)
5094.pdf (Adobe CJK CMap file descriptions)
5097b.pdf (Adobe-Japan2-0 character collection)
If you do not have Adobe Acrobat, there is a freely-available Acrobat
Reader (for Macintosh, Windows, MS-DOS, and UNIX) at the following
URL:
ftp://ftp.adobe.com/pub/adobe/Applications/Acrobat/
I have also placed some CJK character collection materials,
including prototype Unicode (UCS-2 and UTF-8) CMap files, at the
following URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/adobe/
A sample (Adobe-Korea1-0) CIDFont is also available at the above URL.
There is also a somewhat brief description of CID-keyed fonts
at the end of Chapter 6 in UJIP.
7.10: NJWIN
Hongbo Data Systems has recently release a ShareWare ($49 USD)
product called NJWIN whose purpose is to force the display of CJK text
in non-CJK applications running under US Windows 95. Actually, there
are two versions: full CJK and Japanese only.
NJWIN and its full description are available at the following
URL:
http://www.njstar.com.au/njstar/njwin.htm
Other (popular) URLs that carry NJWIN are as follows:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/windows/
ftp://ftp.cc.monash.edu.au/pub/nihongo/
Hongbo Data Systems' e-mail address is:
hongbo@njstar.com.au
Their WWW Home Page is at the following URL:
http://www.njstar.com.au/
PART 8: CJK PROGRAMMING ISSUES
This new section describes issues related to using specific
programming languages to process CJK text.
8.1: C AND C++
At one time I used C on a regular basis for my CJK programming
needs, and released three tools for others to use: JConv, JChar, and
JCode. While these tools are specific to Japanese, they can be easily
adapted for CJK use. Their source code is available at the following
URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/
I also provided several C code snippets in Chapter 7 of
UJIP. These are available in machine-readable form at the following
URL:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch7/
8.2: PERL
Although Perl does not have any special CJK facilities (note
that most implementations of C and C++ do not either), it provides a
powerful programming environment that is useful for many CJK-related
tasks.
The noteworthy features of Perl are associative arrays and
regular expressions. These are features not found in C or C++, and
allow one to write meaningful code in little time.
JPerl is an implementation of Perl that provides two-byte
support for Japanese (EUC or Shift-JIS encoding). It is not ideal
because JPerl scripts often cannot run under (non-Japanese) Perl.
If you often write programs for internal use, I suggest that
you check out Perl to see if it can offer you something. Chances are
that it can. A good place to start looking at Perl are through books
on the subject (see Section A.3.1) and at the following URL:
http://www.perl.com/
For those who like additional reading, "The Perl Journal" is
starting up, and information is at the following URL:
http://work.media.mit.edu/the_perl_journal/
8.3: JAVA
I am just starting to learn about the Java programming
language (and rightly so since my wife is Javanese!). It seems to have
a lot to offer.
The most interesting aspects of Java are:
o Built-in support for Unicode and UTF-8.
o The programmer must write code in the object-oriented paradigm.
o Provides a portable way to supply compiled code.
o Security features for Internet use.
More information on Java are at the following URLs:
http://www.gamelan.com/
http://www.javasoft.com/
Oh, Gamelan is the name of Javanese music.
Of the books about Java published thus far, the one I consider
to be the best is "Java in a Nutshell" by David Flanagan.
One programming feature of Perl that I dearly miss in Java are
regexes (regular expressions). Luckily, some kind person wrote a regex
package for Java based on Perl regexes. Information on this Java regex
package is available at the following URL:
http://www.win.net/~stevesoft/pat/
A FINAL NOTE
I hope that the information presented here will prove
useful. I would like to keep the electronic version of this document
as up-to-date as possible, and through readers' input, I am able to
do so.
Many readers will notice that I am very heavy into UNIX and
Macintosh (well, I recently got my first PC). If anyone has any
information on CJK-capable interfaces for other platforms, please feel
free to send it to me, and I will be sure to include it in the next
version of CJK.INF. Please include sources for the software or
documentation by providing addresses, phone numbers, FTP sites, and so
on.
Please do not hesitate to ask me further question concerning
any subject presented in this document.
ACKNOWLEDGMENTS
I would like to express my deepest thanks to Kazumasa Utashiro
of Internet Initiative Japan (IIJ). He taught to me how to send and
receive Japanese text using the 7-bit ISO 2022 codes back in 1989.
With his help I was able to write JAPAN.INF, my book, and this
document in order to inform others about what he has taught me plus
more.
Next, I thank all the folks at O'Reilly & Associates for
publishing UJIP. Special thanks to Tim O'Reilly for accepting the book
proposal, and to Peter Mui for guiding me through the process. I have
had nothing but good experiences with "them there fine folks."
I got to know Jack Halpern through UJIP, and he subsequently
translated it into Japanese. Many thanks to him.
I am also grateful to my employer, Adobe Systems, for letting
me work on interesting CJK-related projects. I really like what I do
here. In particular, I want to thank Dan Mills, my manager, for
putting up with me for these past four years.
Lastly, I would also like to thank the countless people who
provided comments on JAPAN.INF, UJIP, and CJK.INF. I hope that this
new document lives up to the spirit of my previous efforts.
APPENDIX A: OTHER INFORMATION SOURCES
One of the most useful types of information are pointers to
other information sources. This appendix provides just that.
A.1: USENET NEWSGROUPS AND MAILING LISTS
Appendix L of UJIP provided information on a number of mailing
lists. This section supplements that appendix with information on
other useful mailing lists, and points out which ones in UJIP are
relevant to readers of CJK.INF.
A.1.1: USENET NEWSGROUPS
The following Usenet Newsgroups typically have postings with
information relevant to issues discussed in CJK.INF (in alphabetical
order):
alt.chinese.computing
alt.chinese.text (HZ encoding used for Chinese text)
alt.chinese.text.big5 (Big Five encoding used for Chinese text)
alt.japanese.text (JIS encoding used for Japanese text)
chinese.flame (UTF-7)
chinese.text.unicode (UTF-8)
comp.lang.c
comp.lang.c++
comp.lang.java
comp.lang.perl.misc
comp.software.international
comp.std.internat
fj.editor.mule (JIS encoding used for Japanese text)
fj.kanji (JIS encoding used for Japanese text)
fj.net.infosystems.www.browsers (JIS encoding used for Japanese text)
fj.news.reader (JIS encoding used for Japanese text)
han.comp.hangul
han.sys.mac
sci.lang.japan (JIS encoding used for Japanese text)
If your local news host does not provide a feed of the fj.*
newsgroups (shame on them!), or if you do not have access to Usenet
News, you can alternatively fetch them from the following URL:
ftp://kuso.shef.ac.uk/pub/News/
The subdirectories correspond to the newsgroup name, but with the
"dots" being replaced by "slashes." For example, the "fj.binaries.mac"
newsgroup is archived in the "fj/binaries/mac" subdirectory. Many
thanks to Earl Kinmonth (jp1ek@sunc.shef.uc.uk) for this service.
There are some sites that carry full feeds of the fj.*
newsgroups, and permit public access (meaning that you can configure
your news reader to point to it). The only one I know of thus far is
as follows:
ume.cc.tsukuba.ac.jp
A.1.2: MAILING LISTS
The following are mailing lists that should interest readers
of this document (some are more active than others). The first line
after each entry indicates the address (or addresses) that can be used
for subscribing. The second line is the address for posting.
o CCNET-L MAILING LIST
listserv@uga.uga.edu (or listserv@uga)
ccnet-l@uga.uga.edu
o China Net Mailing List
majordomo@lists.mindspring.com
(See http://www.asia-net.com/ or jobs@asia-net.com)
o EASUG (East Asian Software Users Group) Mailing List
easug-request@guvax.acc.georgetown.edu
easug@guvax.acc.georgetown.edu
o EBTI-L (Electronic Buddhist Text Initiative) Mailing List
ebti-l-request@uxmail.ust.hk
ebti-l@uxmail.ust.hk
o EFJ (Electronic Frontiers Japan) Mailing List
majordomo@lists.twics.com
efj@lists.twics.com
o Hangul Mailing List (han.comp.hangul newsgroup)
majordomo@cair.kaist.ac.kr
hangul@cair.kaist.ac.kr
o INSOFT-L Mailing List
majordomo@trans2.b30.ingr.com
insoft-l@trans2.b30
o ISO 10646 Mailing List
listproc@listproc.hcf.jhu.edu
iso10646@listproc.hcf.jhu.edu
o Japan Net Mailing List
majordomo@lists.mindspring.com
(See http://www.asia-net.com/ or jobs@asia-net.com)
o KanjiTalk Mailing List
kanjitalk-request@cs15.atr-sw.atr.co.jp (or kanjitalk-request@crl.go.jp)
kanjitalk@cs15.atr-sw.atr.co.jp (or kanjitalk@crl.go.jp)
o Mac Mailing List (han.sys.mac newsgroup)
majordomo@krnic.net
mac@krnic.net
o Mule Mailing List
mule-request@etl.go.jp
mule@etl.go.jp or mule-jp@etl.go.jp
o NIHONGO Mailing List (sci.lang.japan newsgroup)
listserv@mitvma.mit.edu (or listserv@mitvma)
nihongo@mitvma.mit.edu
o Nihongo-Hiroba Mailing List
listproc@mcfeeley.cc.utexas.edu
nihongo-hiroba@mcfeeley.cc.utexas.edu
o Nisus Mailing List
listserv@dartmouth.edu
nisus@dartmouth.edu
o TLUG (Tokyo Linux User's Group) Mailing List
majordomo@lists.twics.com
tlug@lists.twics.com
o Unicode Mailing List
unicode-request@unicode.org
unicode@unicode.org
o WNN User Mailing List
wnn-user-request@wnn.astem.or.jp
wnn-user-jp@wnn.astem.or.jp
o WWW Multilingual Mailing List
www-mling-request@square.ntt.jp
www-mling@square.ntt.jp
If the name of the mailing list is part of the subscription address
(such as "easug-request"), the message body should look like this:
subscribe
Including your name is optional. If username in the subscription
address is "listserv" or "majordomo" (these are names of mailing list
managing software), the mailing list name must appear after
"subscribe" in the message body as follows:
subscribe ccnet-l
Again, including your name is optional.
The following URL has information about Japanese-related
mailing lists:
gopher://gan1.ncc.go.jp/11/INFO/mail-lists/
A.2: INTERNET RESOURCES
The Internet provides what I would consider to be the greatest
information resources of all. These can be subcategorized into FTP,
Telnet, Gopher, WWW, and e-mail.
A.2.1: USEFUL FTP SITES
Below are the URLs for useful FTP sites. The directory
specified is the recommended place from which to start poking around
for useful files.
ftp://cair-archive.kaist.ac.kr/pub/hangul/
ftp://etlport.etl.go.jp/pub/mule/
ftp://ftp.adobe.com/pub/adobe/
ftp://ftp.cc.monash.edu.au/pub/nihongo/
ftp://ftp.ifcss.org/pub/software/
ftp://ftp.ora.com/pub/examples/nutshell/ujip/
ftp://ftp.sra.co.jp/pub/
ftp://ftp.uwtc.washington.edu/pub/Japanese/
ftp://kuso.shef.ac.uk/pub/Japanese/
ftp://unicode.org/pub/
This list is expected to grow.
A.2.2: USEFUL TELNET SITES
For those who have a NIFTY-Serve account, there is now a very
convenient way to access NIFTY-Serve using telnet. The URL is as
follows:
telnet://r2.niftyserve.or.jp/
Information about what NIFTY-Serve has to offer (and how to subscribe)
can be found at the following URL:
http://www.nifty.co.jp/
Another information service with a similar access mechanism is
CompuServe, whose URL is as follows:
telnet://compuserve.com/
You will need to press the return key to get the "Host Name:" prompt,
at which time you type "cis" (just follow the menus from this point
on).
You can also do a search on fj.* newsgroup articles at the
following URL:
telnet://asahi-net.or.jp/
You login as "fj-db" once you are connected.
A.2.3: USEFUL GOPHER SITES
I am not too much of a Gopher user. There, of course, is the
following:
gopher://gopher.ora.com/
Another Gopher site provides information on Japanese-related mailing
lists:
gopher://gan1.ncc.go.jp/11/INFO/mail-lists/
If you happen to know of others, please let me know.
A.2.4: USEFUL WWW SITES
Because the World-Wide Web is a constantly changing place (and
more importantly, because I don't want to re-issue a new version of
this document every month!), I will maintain links to useful documents
at my WWW Home Page. Its URL is as follows:
http://jasper.ora.com/lunde/
If you cannot get to my WWW Home Page, you couldn't get to any that I
would list here anyway.
A.2.5: USEFUL MAIL SERVERS
In the past (that is, in JAPAN.INF) I included a full list of
the domains in the "jp" hierarchy. That took up a lot of space, and
changes very rapidly. You can now send a request to a mail server in
order to return the most current listing. The mail server is:
mail-server@nic.ad.jp
The most common command is "send," and the following arguments can be
supplied to retrieve specific documents (and should be in the message
body, not on the "Subject:" line):
send help
send index
send jpnic/domain-list.txt
send jpnic/domain-list-e.txt
The first sends back a help file, the second sends back a complete
index of files that can be retrieved (use this one to see what other
useful stuff is available), and the last two send back a complete
listing of domains in the "fj" hierarchy (the last one send it back in
English/romanized).
A.3: OTHER RESOURCES
This section provides pointers to specific documentation
available electronically or in print.
A.3.1: BOOKS
There are other useful reference materials available in print
or online, in addition to the various national and international
standards mentioned throughout this document. The following are books
that I recommend for further reading or mental stimulus. (Sorry for
plugging my own books in this list, but they are relevant.)
o Clews, John. "Language Automation Worldwide: The Development of
Character Set Standards." SESAME Computer Projects. 1988. ISBN
1-870095-01-4.
o Flanagan, David. "Java in a Nutshell." O'Reilly & Associates,
Inc. 1996. ISBN 1-56592-183-6.
o Frisch, AEleen. "Essential System Administration." Second Edition.
O'Reilly & Associates, Inc. 1995. ISBN 1-56592-127-5.
o Huang, Jack & Timothy Huang. "An Introduction to Chinese, Japanese
and Korean Computing." World Scientific Computing. 1989. ISBN
9971-50-664-5.
o IBM Corporation. "Character Data Representation Architecture - Level
2, Registry." 1993. IBM order number SC09-1391-01.
o Kano, Nadine. "Developing International Software for Windows 95 and
Windows NT." Microsoft Press. 1995. ISBN 1-55615-840-8.
o Kirch, Olaf. "Linux Network Administrator's Guide." O'Reilly &
Associates, Inc. 1995. ISBN 1-56592-087-2.
o Kissell, Joe. "The Nisus Way." MIS:Press. 1996. ISBN 1-55828-455-9.
o Krol, Ed. "The Whole Internet User's Guide & Catalog." Second
Edition. O'Reilly & Associates, Inc. 1994. ISBN 1-56592-063-5.
o Liu, Cricket et al. "Managing Internet Information Services."
O'Reilly & Associates, Inc. 1994. ISBN 1-56592-062-7.
o Lunde, Ken. "Understanding Japanese Information Processing."
O'Reilly & Associates, Incorporated. 1993. ISBN 1-56592-043-0. LCCN
PL524.5.L86 1993.
o Lunde, Ken. "Nihongo Joho Shori." SOFTBANK Corporation. 1995. ISBN
4-89052-708-7.
o Luong, Tuoc V. et al. "Internationalization: Developing Software for
Global Markets." John Wiley & Sons, Incorporated. 1995. ISBN
0-471-07661-9.
o Schwartz, Randal L. "Learning Perl." O'Reilly & Associates,
Incorporated. 1993. ISBN 1-56592-042-2.
o Stallman, Richard M. "GNU Emacs Manual." Tenth edition. Free
Software Foundation. 1994. ISBN 1-882114-04-3.
o Tuthill, Bill. "Solaris International Developer's Guide." SunSoft
Press and PTR Prentice Hall. 1993. ISBN 0-13-031063-8.
o Unicode Consortium, The. "The Unicode Standard: Worldwide Character
Encoding." Version 1.0. Volume 2. Addison-Wesley. 1992. ISBN
0-201-60845-6.
o Vromans, Johan. "Perl 5 Desktop Reference." O'Reilly & Associates,
Inc. 1996. ISBN 1-56592-187-9.
o Wall, Larry & Randal L. Schwartz. "Programming Perl." O'Reilly &
Associates, Incorporated. 1991. ISBN 0-937175-64-1.
o Welsh, Matt & Lar Kaufman. "Running Linux." O'Reilly & Associates,
Inc. 1995. ISBN 1-56592-100-3.
If you want to get your hands on any of the national or
international standards mentioned in this document, I suggest the
following:
o The American National Standards Institute can provide ISO, KS, and
JIS standards. Bear in mind that ISO standards will most likely
arrive as a photocopy of the original.
ANSI
11 West 42nd Street
New York, NY 10036
USA
+1-212-642-4900 (phone)
+1-212-302-1286 (facsimile)
o The International Organization for Standardization can provide
ISO standards.
ISO
1, rue de Varemb
Case postale 56
CH-1211, Geneva 20
SWITZERLAND
+41-22-749-01-11 (phone)
+41-22-733-34-30 (facsimile)
central@isocs.iso.ch (e-mail)
http://www.iso.ch/ (WWW)
o Chinese (GB and CNS) standards are the hardest to obtain. It is
quite unfortunate.
A.3.2: MAGAZINES
o "Computing Japan," published monthly, ISSN 1340-7228,
editors@cj.gol.com.
o "MANGAJIN," published 10 times per year, ISSN 1051-8177.
o "Multilingual Communications & Computing," published bi-monthly,
ISSN 1065-7657, info@multilingual.com.
o "The Perl Journal," published quarterly, ISSN 1087-903X,
perl-journal-subscriptions@perl.com.
A.3.3: JOURNALS
o "Chinese Information Processing" (CIP), published bi-monthly, ISSN
1003-9082. (In Chinese.)
o "Computer Processing of Chinese & Oriental Languages" (CPCOL),
co-published twice a year by World Scientific Publishing and Chinese
Language Computer Society (CLCS), ISSN 0715-9048.
o "The Electronic Bodhidharma," published by the International
Research Institute for Zen (IRIZ) Buddhism, Hanazono University,
Japan. More information on the organization that publishes this
journal is available at the following URL:
http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm
A.3.4: RFCs
Many RFCs (Request For Comments) are relevant to this
document. They are:
o RFC 1341: "MIME (Multipurpose Internet Mail Extensions): Mechanisms
for Specifying and Describing the Format of Internet Message
Bodies," by Nathaniel Borenstein and Ned Freed, June 1992.
o RFC 1342: "Representation of Non-ASCII Text in Internet Message
Headers," by Keith Moore, June 1992.
o RFC 1468: "Japanese Character Encoding for Internet Messages," by
Jun Murai et al., June 1993.
o RFC 1521: "MIME (Multipurpose Internet Mail Extensions) Part One:
Mechanisms for Specifying and Describing the Format of Internet
Message Bodies," by Nathaniel Borenstein and Ned Freed, September
1993. Obsoletes RFC 1341.
o RFC 1522: "MIME (Multipurpose Internet Mail Extensions) Part Two:
Message Header Extensions for Non-ASCII Text," by Keith Moore,
September 1993. Obsoletes RFC 1342.
o RFC 1554: "ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP," by
Masataka Ohta and Kenichi Handa, December 1993.
o RFC 1557: "Korean Character Encoding for Internet Messages," by
Uhhyung Choi et al., December 1993.
o RFC 1642: "UTF-7: A Mail-Safe Transformation Format of Unicode," by
David Goldsmith and Mark Davis, July 1994.
o RFC 1815: "Character Sets ISO-10646 and ISO-10646-J-1," by Masataka
Ohta, July 1995.
o RFC 1842: "ASCII Printable Characters-Based Chinese Character
Encoding for Internet Messages," by Ya-Gui Wei et al., August 1995.
o RFC 1843: "HZ - A Data Format for Exchanging Files of Arbitrarily
Mixed Chinese and ASCII Characters," by Fung Fung Lee, August 1995.
o RFC 1922: "Chinese Character Encoding for Internet Messages," by
Haifeng Zhu et al., March 1996.
These RFCs can be obtained from FTP archives that contain all RFC
documents, such as at the following URLs
ftp://nic.ddn.mil/rfc/
ftp://ftp.uu.net/inet/rfc/
But these specific ones are mirrored at the following URL for
convenience:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/
A.3.5: FAQs
There are several FAQ (Frequently Asked Questions) files that
provide useful information. The following is a listing of some along
with their URLs:
o "Japanese Language Information" FAQ (formerly the "sci.lang.japan"
FAQ) by Rafael Santos (santos@mickey.ai.kyutech.ac.jp) at:
http://www.mickey.ai.kyutech.ac.jp/cgi-bin/japanese/
Update announcements are usually posted to the sci.lang.japan
newsgroup.
o "Programming for Internationalization" FAQ by Michael Gschwind
(mike@vlsivie.tuwien.ac.at) at:
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-programming
Also posted to the comp.software.international newsgroup. This and
other internationalization documents are also accessible through the
following URL:
http://www.vlsivie.tuwien.ac.at/mike/i18n.html
o Three FAQs about Internet Service Providers in Japan by Taki Naruto
(tn@panix.com), Jesse Casman (jcasman@unm.edu), and Kenji Yoshida
(kenny@mb.tokyo.infoweb.or.jp), respectively, at:
http://www.panix.com/~tn/ispj.html
http://nobunaga.unm.edu/internet.html
http://cswww2.essex.ac.uk/users/whean/japan/net.html
o "Internationalization Reference List" by Eugene Dorr
(gdorr@pgh.legent.com) at:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/i18n-books.txt
Note really a FAQ, but quite useful because it is a very complete
listing of I18N-related books.
o "INSOFT-L Service" by Brian Tatro (btatro@tatro.com) at:
http://iquest.com/~btatro/in2.html
This includes a link to the FAQ for the INSOFT-L Mailing List (see
Section A.1.2).
o "How to Use Japanese on the Internet with a PC: From Login to WWW"
by Hideki Hirayama (sgw01623@niftyserve.or.jp) at:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/faq/jpn-inet.FAQ
o "Hangul and Internet in Korea" FAQ by Jungshik Shin
(jshin@minerva.cis.yale.edu) at:
http://pantheon.cis.yale.edu/~jshin/faq/
--- END (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES ---
|