diff options
Diffstat (limited to 'doc/encoding.n')
| -rw-r--r-- | doc/encoding.n | 95 |
1 files changed, 61 insertions, 34 deletions
diff --git a/doc/encoding.n b/doc/encoding.n index 5aac181..c1dbf27 100644 --- a/doc/encoding.n +++ b/doc/encoding.n @@ -14,16 +14,10 @@ encoding \- Manipulate encodings .BE .SH INTRODUCTION .PP -Strings in Tcl are logically a sequence of 16-bit Unicode characters. +Strings in Tcl are logically a sequence of Unicode characters. These strings are represented in memory as a sequence of bytes that -may be in one of several encodings: modified UTF\-8 (which uses 1 to 3 -bytes per character), 16-bit -.QW Unicode -(which uses 2 bytes per character, with an endianness that is -dependent on the host architecture), and binary (which uses a single -byte per character but only handles a restricted range of characters). -Tcl does not guarantee to always use the same encoding for the same -string. +may be in one of several encodings: modified UTF\-8 (which uses 1 to 4 +bytes per character), or a custom encoding start as 8 bit binary data. .PP Different operating system interfaces or applications may generate strings in other encodings such as Shift\-JIS. The \fBencoding\fR @@ -34,16 +28,30 @@ formats. Performs one of several encoding related operations, depending on \fIoption\fR. The legal \fIoption\fRs are: .TP -\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR +\fBencoding convertfrom\fR ?\fB-nocomplain\fR? ?\fB-failindex var\fR? +?\fIencoding\fR? \fIdata\fR . -Convert \fIdata\fR to Unicode from the specified \fIencoding\fR. The -characters in \fIdata\fR are treated as binary data where the lower -8-bits of each character is taken as a single byte. The resulting -sequence of bytes is treated as a string in the specified -\fIencoding\fR. If \fIencoding\fR is not specified, the current +Convert \fIdata\fR to a Unicode string from the specified \fIencoding\fR. The +characters in \fIdata\fR are 8 bit binary data. The resulting +sequence of bytes is a string created by applying the given \fIencoding\fR +to the data. If \fIencoding\fR is not specified, the current system encoding is used. +. +The call fails on convertion errors, like an incomplete utf-8 sequence. +The option \fB-failindex\fR is followed by a variable name. The variable +is set to \fI-1\fR if no conversion error occured. It is set to the +first error location in \fIdata\fR in case of a conversion error. All data +until this error location is transformed and retured. This option may not +be used together with \fB-nocomplain\fR. +. +The call does not fail on conversion errors, if the option +\fB-nocomplain\fR is given. In this case, any error locations are replaced +by \fB?\fR. Incomplete sequences are written verbatim to the output string. +The purpose of this switch is to gain compatibility to prior versions of TCL. +It is not recommended for any other usage. .TP -\fBencoding convertto\fR ?\fIencoding\fR? \fIstring\fR +\fBencoding convertto\fR ?\fB-nocomplain\fR? ?\fB-failindex var\fR? +?\fIencoding\fR? \fIstring\fR . Convert \fIstring\fR from Unicode to the specified \fIencoding\fR. The result is a sequence of bytes that represents the converted @@ -51,6 +59,21 @@ string. Each byte is stored in the lower 8-bits of a Unicode character (indeed, the resulting string is a binary string as far as Tcl is concerned, at least initially). If \fIencoding\fR is not specified, the current system encoding is used. +. +The call fails on convertion errors, like a Unicode character not representable +in the given \fIencoding\fR. +. +The option \fB-failindex\fR is followed by a variable name. The variable +is set to \fI-1\fR if no conversion error occured. It is set to the +first error location in \fIdata\fR in case of a conversion error. All data +until this error location is transformed and retured. This option may not +be used together with \fB-nocomplain\fR. +. +The call does not fail on conversion errors, if the option +\fB-nocomplain\fR is given. In this case, any error locations are replaced +by \fB?\fR. Incomplete sequences are written verbatim to the output string. +The purpose of this switch is to gain compatibility to prior versions of TCL. +It is not recommended for any other usage. .TP \fBencoding dirs\fR ?\fIdirectoryList\fR? . @@ -81,31 +104,35 @@ omitted then the command returns the current system encoding. The system encoding is used whenever Tcl passes strings to system calls. .SH EXAMPLE .PP -It is common practice to write script files using a text editor that -produces output in the euc-jp encoding, which represents the ASCII -characters as singe bytes and Japanese characters as two bytes. This -makes it easy to embed literal strings that correspond to non-ASCII -characters by simply typing the strings in place in the script. -However, because the \fBsource\fR command always reads files using the -current system encoding, Tcl will only source such files correctly -when the encoding used to write the file is the same. This tends not -to be true in an internationalized setting. For example, if such a -file was sourced in North America (where the ISO8859\-1 is normally -used), each byte in the file would be treated as a separate character -that maps to the 00 page in Unicode. The resulting Tcl strings will -not contain the expected Japanese characters. Instead, they will -contain a sequence of Latin-1 characters that correspond to the bytes -of the original string. The \fBencoding\fR command can be used to -convert this string to the expected Japanese Unicode characters. For -example, +The following example converts a byte sequence in Japanese euc-jp encoding to a TCL string: .PP .CS set s [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"] .CE .PP -would return the Unicode string +The result is the unicode codepoint: .QW "\eu306F" , which is the Hiragana letter HA. +.PP +The following example detects the error location in an incomplete UTF-8 sequence: +.PP +.CS +% set s [\fBencoding convertfrom\fR -failindex i utf-8 "A\exC3"] +A +% set i +1 +.CE +.PP +The following example detects the error location while transforming to ISO8859-1 +(ISO-Latin 1): +.PP +.CS +% set s [\fBencoding convertto\fR -failindex i utf-8 "A\eu0141"] +A +% set i +1 +.CE +.PP .SH "SEE ALSO" Tcl_GetEncoding(3) .SH KEYWORDS |
