diff options
Diffstat (limited to 'Modules/cjkcodecs/README')
-rw-r--r-- | Modules/cjkcodecs/README | 87 |
1 files changed, 87 insertions, 0 deletions
diff --git a/Modules/cjkcodecs/README b/Modules/cjkcodecs/README new file mode 100644 index 0000000..dc32574 --- /dev/null +++ b/Modules/cjkcodecs/README @@ -0,0 +1,87 @@ +Notes on cjkcodecs +------------------- +This directory contains source files for cjkcodecs extension modules. +They are based on CJKCodecs (http://cjkpython.i18n.org/#CJKCodecs) +as of Jan 17 2004 currently. + + + +To generate or modify mapping headers +------------------------------------- +Mapping headers are imported from CJKCodecs as pre-generated form. +If you need to tweak or add something on it, please look at tools/ +subdirectory of CJKCodecs' distribution. + + + +Notes on implmentation characteristics of each codecs +----------------------------------------------------- + +1) Big5 codec + + The big5 codec maps the following characters as cp950 does rather + than conforming Unicode.org's that maps to 0xFFFD. + + BIG5 Unicode Description + + 0xA15A 0x2574 SPACING UNDERSCORE + 0xA1C3 0xFFE3 SPACING HEAVY OVERSCORE + 0xA1C5 0x02CD SPACING HEAVY UNDERSCORE + 0xA1FE 0xFF0F LT DIAG UP RIGHT TO LOW LEFT + 0xA240 0xFF3C LT DIAG UP LEFT TO LOW RIGHT + 0xA2CC 0x5341 HANGZHOU NUMERAL TEN + 0xA2CE 0x5345 HANGZHOU NUMERAL THIRTY + + Because unicode 0x5341, 0x5345, 0xFF0F, 0xFF3C is mapped to another + big5 codes already, a roundtrip compatibility is not guaranteed for + them. + + +2) cp932 codec + + To conform to Windows's real mapping, cp932 codec maps the following + codepoints in addition of the official cp932 mapping. + + CP932 Unicode Description + + 0x80 0x80 UNDEFINED + 0xA0 0xF8F0 UNDEFINED + 0xFD 0xF8F1 UNDEFINED + 0xFE 0xF8F2 UNDEFINED + 0xFF 0xF8F3 UNDEFINED + + +3) euc-jisx0213 codec + + The euc-jisx0213 codec maps JIS X 0213 Plane 1 code 0x2140 into + unicode U+FF3C instead of U+005C as on unicode.org's mapping. + Because euc-jisx0213 has REVERSE SOLIDUS on 0x5c already and A140 + is shown as a full width character, mapping to U+FF3C can make + more sense. + + The euc-jisx0213 codec is enabled to decode JIS X 0212 codes on + codeset 2. Because JIS X 0212 and JIS X 0213 Plane 2 don't have + overlapped by each other, it doesn't bother standard conformations + (and JIS X 0213 Plane 2 is intended to use so.) On encoding + sessions, the codec will try to encode kanji characters in this + order: + + JIS X 0213 Plane 1 -> JIS X 0213 Plane 2 -> JIS X 0212 + + +4) euc-jp codec + + The euc-jp codec is a compatibility instance on these points: + - U+FF3C FULLWIDTH REVERSE SOLIDUS is mapped to EUC-JP A1C0 (vice versa) + - U+00A5 YEN SIGN is mapped to EUC-JP 0x5c. (one way) + - U+203E OVERLINE is mapped to EUC-JP 0x7e. (one way) + + +5) shift-jis codec + + The shift-jis codec is mapping 0x20-0x7e area to U+20-U+7E directly + instead of using JIS X 0201 for compatibility. The differences are: + - U+005C REVERSE SOLIDUS is mapped to SHIFT-JIS 0x5c. + - U+007E TILDE is mapped to SHIFT-JIS 0x7e. + - U+FF3C FULL-WIDTH REVERSE SOLIDUS is mapped to SHIFT-JIS 815f. + |