summaryrefslogtreecommitdiffstats
path: root/doc/encoding.n
diff options
context:
space:
mode:
Diffstat (limited to 'doc/encoding.n')
-rw-r--r--doc/encoding.n95
1 files changed, 61 insertions, 34 deletions
diff --git a/doc/encoding.n b/doc/encoding.n
index 5aac181..c1dbf27 100644
--- a/doc/encoding.n
+++ b/doc/encoding.n
@@ -14,16 +14,10 @@ encoding \- Manipulate encodings
.BE
.SH INTRODUCTION
.PP
-Strings in Tcl are logically a sequence of 16-bit Unicode characters.
+Strings in Tcl are logically a sequence of Unicode characters.
These strings are represented in memory as a sequence of bytes that
-may be in one of several encodings: modified UTF\-8 (which uses 1 to 3
-bytes per character), 16-bit
-.QW Unicode
-(which uses 2 bytes per character, with an endianness that is
-dependent on the host architecture), and binary (which uses a single
-byte per character but only handles a restricted range of characters).
-Tcl does not guarantee to always use the same encoding for the same
-string.
+may be in one of several encodings: modified UTF\-8 (which uses 1 to 4
+bytes per character), or a custom encoding start as 8 bit binary data.
.PP
Different operating system interfaces or applications may generate
strings in other encodings such as Shift\-JIS. The \fBencoding\fR
@@ -34,16 +28,30 @@ formats.
Performs one of several encoding related operations, depending on
\fIoption\fR. The legal \fIoption\fRs are:
.TP
-\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
+\fBencoding convertfrom\fR ?\fB-nocomplain\fR? ?\fB-failindex var\fR?
+?\fIencoding\fR? \fIdata\fR
.
-Convert \fIdata\fR to Unicode from the specified \fIencoding\fR. The
-characters in \fIdata\fR are treated as binary data where the lower
-8-bits of each character is taken as a single byte. The resulting
-sequence of bytes is treated as a string in the specified
-\fIencoding\fR. If \fIencoding\fR is not specified, the current
+Convert \fIdata\fR to a Unicode string from the specified \fIencoding\fR. The
+characters in \fIdata\fR are 8 bit binary data. The resulting
+sequence of bytes is a string created by applying the given \fIencoding\fR
+to the data. If \fIencoding\fR is not specified, the current
system encoding is used.
+.
+The call fails on convertion errors, like an incomplete utf-8 sequence.
+The option \fB-failindex\fR is followed by a variable name. The variable
+is set to \fI-1\fR if no conversion error occured. It is set to the
+first error location in \fIdata\fR in case of a conversion error. All data
+until this error location is transformed and retured. This option may not
+be used together with \fB-nocomplain\fR.
+.
+The call does not fail on conversion errors, if the option
+\fB-nocomplain\fR is given. In this case, any error locations are replaced
+by \fB?\fR. Incomplete sequences are written verbatim to the output string.
+The purpose of this switch is to gain compatibility to prior versions of TCL.
+It is not recommended for any other usage.
.TP
-\fBencoding convertto\fR ?\fIencoding\fR? \fIstring\fR
+\fBencoding convertto\fR ?\fB-nocomplain\fR? ?\fB-failindex var\fR?
+?\fIencoding\fR? \fIstring\fR
.
Convert \fIstring\fR from Unicode to the specified \fIencoding\fR.
The result is a sequence of bytes that represents the converted
@@ -51,6 +59,21 @@ string. Each byte is stored in the lower 8-bits of a Unicode
character (indeed, the resulting string is a binary string as far as
Tcl is concerned, at least initially). If \fIencoding\fR is not
specified, the current system encoding is used.
+.
+The call fails on convertion errors, like a Unicode character not representable
+in the given \fIencoding\fR.
+.
+The option \fB-failindex\fR is followed by a variable name. The variable
+is set to \fI-1\fR if no conversion error occured. It is set to the
+first error location in \fIdata\fR in case of a conversion error. All data
+until this error location is transformed and retured. This option may not
+be used together with \fB-nocomplain\fR.
+.
+The call does not fail on conversion errors, if the option
+\fB-nocomplain\fR is given. In this case, any error locations are replaced
+by \fB?\fR. Incomplete sequences are written verbatim to the output string.
+The purpose of this switch is to gain compatibility to prior versions of TCL.
+It is not recommended for any other usage.
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
.
@@ -81,31 +104,35 @@ omitted then the command returns the current system encoding. The
system encoding is used whenever Tcl passes strings to system calls.
.SH EXAMPLE
.PP
-It is common practice to write script files using a text editor that
-produces output in the euc-jp encoding, which represents the ASCII
-characters as singe bytes and Japanese characters as two bytes. This
-makes it easy to embed literal strings that correspond to non-ASCII
-characters by simply typing the strings in place in the script.
-However, because the \fBsource\fR command always reads files using the
-current system encoding, Tcl will only source such files correctly
-when the encoding used to write the file is the same. This tends not
-to be true in an internationalized setting. For example, if such a
-file was sourced in North America (where the ISO8859\-1 is normally
-used), each byte in the file would be treated as a separate character
-that maps to the 00 page in Unicode. The resulting Tcl strings will
-not contain the expected Japanese characters. Instead, they will
-contain a sequence of Latin-1 characters that correspond to the bytes
-of the original string. The \fBencoding\fR command can be used to
-convert this string to the expected Japanese Unicode characters. For
-example,
+The following example converts a byte sequence in Japanese euc-jp encoding to a TCL string:
.PP
.CS
set s [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"]
.CE
.PP
-would return the Unicode string
+The result is the unicode codepoint:
.QW "\eu306F" ,
which is the Hiragana letter HA.
+.PP
+The following example detects the error location in an incomplete UTF-8 sequence:
+.PP
+.CS
+% set s [\fBencoding convertfrom\fR -failindex i utf-8 "A\exC3"]
+A
+% set i
+1
+.CE
+.PP
+The following example detects the error location while transforming to ISO8859-1
+(ISO-Latin 1):
+.PP
+.CS
+% set s [\fBencoding convertto\fR -failindex i utf-8 "A\eu0141"]
+A
+% set i
+1
+.CE
+.PP
.SH "SEE ALSO"
Tcl_GetEncoding(3)
.SH KEYWORDS