diff options
Diffstat (limited to 'doc/encoding.n')
| -rw-r--r-- | doc/encoding.n | 245 |
1 files changed, 65 insertions, 180 deletions
diff --git a/doc/encoding.n b/doc/encoding.n index c881d26..1c0bfa9 100644 --- a/doc/encoding.n +++ b/doc/encoding.n @@ -1,208 +1,93 @@ '\" -'\" Copyright (c) 1998 Scriptics Corporation. -'\" +'\" Copyright (c) 1998 by Scriptics Corporation. +'\" '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. -'\" +'\" .TH encoding n "8.1" Tcl "Tcl Built-In Commands" .so man.macros .BS .SH NAME -encoding \- Work with encodings +encoding \- Manipulate encodings .SH SYNOPSIS -\fBencoding \fIoperation\fR ?\fIarg arg ...\fR? +\fBencoding \fIoption\fR ?\fIarg arg ...\fR? .BE + .SH INTRODUCTION .PP -In Tcl every string is composed of Unicode values. Text may be encoded into an -encoding such as cp1252, iso8859-1, Shitf\-JIS, utf-8, utf-16, etc. Not every -Unicode vealue is encodable in every encoding, and some encodings can encode -values that are not available in Unicode. -.PP -Even though Unicode is for encoding the written texts of human languages, any -sequence of bytes can be encoded as the first 255 Unicode values. iso8859-1 an -encoding for a subset of Unicode in which each byte is a Unicode value of 255 -or less. Thus, any sequence of bytes can be considered to be a Unicode string -encoded in iso8859-1. To work with binary data in Tcl, decode it from -iso8859-1 when reading it in, and encode it into iso8859-1 when writing it out, -ensuring that each character in the string has a value of 255 or less. -Decoding such a string does nothing, and encoding encoding such a string also -does nothing. -.PP -For example, the following is true: -.CS -set text {In Tcl binary data is treated as Unicode text and it just works.} -set encoded [encoding convertto iso8859-1 $text] -expr {$text eq $encoded}; #-> 1 -.CE -The following is also true: -.CS -set decoded [encoding convertfrom iso8859-1 $text] -expr {$text eq $decoded}; #-> 1 -.CE +Strings in Tcl are encoded using 16-bit Unicode characters. Different +operating system interfaces or applications may generate strings in +other encodings such as Shift-JIS. The \fBencoding\fR command helps +to bridge the gap between Unicode and these other formats. .SH DESCRIPTION .PP -Performs one of the following encoding \fIoperations\fR: +Performs one of several encoding related operations, depending on +\fIoption\fR. The legal \fIoption\fRs are: .TP \fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR +Convert \fIdata\fR to Unicode from the specified \fIencoding\fR. The +characters in \fIdata\fR are treated as binary data where the lower +8-bits of each character is taken as a single byte. The resulting +sequence of bytes is treated as a string in the specified +\fIencoding\fR. If \fIencoding\fR is not specified, the current +system encoding is used. .TP -\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR -. -Decodes \fIdata\fR encoded in \fIencoding\fR. If \fIencoding\fR is not -specified the current system encoding is used. - -.VS "TCL8.7 TIP607, TIP656" -\fB-profile\fR determines how invalid data for the encoding are handled. See -the \fBPROFILES\fR section below for details. Returns an error if decoding -fails. However, if \fB-failindex\fR given, returns the result of the -conversion up to the point of termination, and stores in \fBvar\fR the index of -the character that could not be converted. If no errors are encountered the -entire result of the conversion is returned and the value \fB-1\fR is stored in -\fBvar\fR. -.VE "TCL8.7 TIP607, TIP656" -.TP -\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR -.TP -\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR -. -Converts \fIstring\fR to \fIencoding\fR. If \fIencoding\fR is not given, the -current system encoding is used. - -.VS "TCL8.7 TIP607, TIP656" -See \fBencoding convertfrom\fR for the meaning of \fB-profile\fR and \fB-failindex\fR. -.VE "TCL8.7 TIP607, TIP656" +\fBencoding convertto\fR ?\fIencoding\fR? \fIstring\fR +Convert \fIstring\fR from Unicode to the specified \fIencoding\fR. +The result is a sequence of bytes that represents the converted +string. Each byte is stored in the lower 8-bits of a Unicode +character. If \fIencoding\fR is not specified, the current +system encoding is used. .TP \fBencoding dirs\fR ?\fIdirectoryList\fR? -. -Sets the search path for \fB*.enc\fR encoding data files to the list of -directories given by \fIdirectoryList\fR. If \fIdirectoryList\fR is not given, -returns the current list of directories that make up the search path. It is -not an error for an item in \fIdirectoryList\fR to not refer to a readable, -searchable directory. +.VS 8.5 +Tcl can load encoding data files from the file system that describe +additional encodings for it to work with. This command sets the search +path for \fB*.enc\fR encoding data files to the list of directories +\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the +command returns the current list of directories that make up the +search path. It is an error for \fIdirectoryList\fR to not be a valid +list. If, when a search for an encoding data file is happening, an +element in \fIdirectoryList\fR does not refer to a readable, +searchable directory, that element is ignored. +.VE 8.5 .TP \fBencoding names\fR -. -Returns a list of the names of available encodings. -The encodings -.QW utf-8 -and -.QW iso8859-1 -are guaranteed to be present in the list. -.VS "TCL8.7 TIP656" -.TP -\fBencoding profiles\fR -Returns a list of names of available encoding profiles. See \fBPROFILES\fR -below. -.VE "TCL8.7 TIP656" +Returns a list containing the names of all of the encodings that are +currently available. .TP \fBencoding system\fR ?\fIencoding\fR? -. -Sets the system encoding to \fIencoding\fR. If \fIencoding\fR is not given, -returns the current system encoding. The system encoding is used to pass -strings to system calls. -.\" Do not put .VS on whole section as that messes up the bullet list alignment -.SH PROFILES -.PP -.VS "TCL8.7 TIP656" -Each \fIprofile\fR is a distinct strategy for dealing with invalid data for an -encoding. -.PP -The following profiles are currently implemented. -.VS "TCL8.7 TIP656" -.TP -\fBtcl8\fR -. -The default profile. Provides for behaviour identical to that of Tcl 8.6: When -decoding, for encodings \fBother than utf-8\fR, each invalid byte is interpreted -as the Unicode value given by that one byte. For example, the byte 0x80, which -is invalid in the ASCII encoding would be mapped to the Unicode value U+0080. -For \fButf-8\fR, each invalid byte that is a valid CP1252 character is -interpreted as the Unicode value for that character, while each byte that is -not is treated as the Unicode value given by that one byte. For example, byte -0x80 is defined by CP1252 and is therefore mapped to its Unicode equivalent -U+20AC while byte 0x81 which is not defined by CP1252 is mapped to U+0081. As -an additional special case, the sequence 0xC0 0x80 is mapped to U+0000. - -When encoding, each character that cannot be represented in the encoding is -replaced by an encoding-dependent character, usually the question mark \fB?\fR. -.TP -\fBstrict\fR -. -The operation fails when invalid data for the encoding are encountered. -.TP -\fBreplace\fR -. -When decoding, invalid bytes are replaced by U+FFFD, the Unicode REPLACEMENT -CHARACTER. - -When encoding, Unicode values that cannot be represented in the target encoding -are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT -CHARACTER for UTF targets, and generally `?` for other encodings. -.VE "TCL8.7 TIP656" -.SH EXAMPLES -.PP -These examples use the utility proc below that prints the Unicode value for -each character in a string. -.PP -.CS -proc codepoints s {join [lmap c [split $s {}] { - string cat U+ [format %.6X [scan $c %c]]}] -} -.CE -.PP -Example 1: Convert from euc-jp: -.PP +Set the system encoding to \fIencoding\fR. If \fIencoding\fR is +omitted then the command returns the current system encoding. The +system encoding is used whenever Tcl passes strings to system calls. +.SH EXAMPLE +.PP +It is common practice to write script files using a text editor that +produces output in the euc-jp encoding, which represents the ASCII +characters as singe bytes and Japanese characters as two bytes. This +makes it easy to embed literal strings that correspond to non-ASCII +characters by simply typing the strings in place in the script. +However, because the \fBsource\fR command always reads files using the +current system encoding, Tcl will only source such files correctly +when the encoding used to write the file is the same. This tends not +to be true in an internationalized setting. For example, if such a +file was sourced in North America (where the ISO8859-1 is normally +used), each byte in the file would be treated as a separate character +that maps to the 00 page in Unicode. The resulting Tcl strings will +not contain the expected Japanese characters. Instead, they will +contain a sequence of Latin-1 characters that correspond to the bytes +of the original string. The \fBencoding\fR command can be used to +convert this string to the expected Japanese Unicode characters. For +example, .CS -% codepoints [\fBencoding convertfrom\fR euc-jp \exA4\exCF] -U+00306F +set s [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"] .CE -.PP -The result is the Unicode value +would return the Unicode string .QW "\eu306F" , which is the Hiragana letter HA. -.VS "TCL8.7 TIP607, TIP656" -.PP -Example 2: Error handling based on profiles: -.PP -The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid -in ASCII encoding. -.PP -.CS -% codepoints [encoding convertfrom -profile tcl8 ascii A\ex80] -U+000041 U+000080 -% codepoints [encoding convertfrom -profile replace ascii A\ex80] -U+000041 U+00FFFD -% codepoints [encoding convertfrom -profile strict ascii A\ex80] -unexpected byte sequence starting at index 1: '\ex80' -.CE -.PP -Example 3: Get partial data and the error location: -.PP -.CS -% codepoints [encoding convertfrom -profile strict -failindex idx ascii AB\ex80] -U+000041 U+000042 -% set idx -2 -.CE -.PP -Example 4: Encode a character that is not representable in ISO8859-1: -.PP -.CS -% encoding convertto iso8859-1 A\eu0141 -A? -% encoding convertto -profile strict iso8859-1 A\eu0141 -unexpected character at index 1: 'U+000141' -% encoding convertto -profile strict -failindex idx iso8859-1 A\eu0141 -A -% set idx -1 -.CE -.VE "TCL8.7 TIP607, TIP656" -.PP + .SH "SEE ALSO" -Tcl_GetEncoding(3), fconfigure(n) +Tcl_GetEncoding(3) + .SH KEYWORDS -encoding, unicode -.\" Local Variables: -.\" mode: nroff -.\" End: +encoding |
