summaryrefslogtreecommitdiffstats
path: root/doc/encoding.n
diff options
context:
space:
mode:
Diffstat (limited to 'doc/encoding.n')
-rw-r--r--doc/encoding.n245
1 files changed, 65 insertions, 180 deletions
diff --git a/doc/encoding.n b/doc/encoding.n
index c881d26..1c0bfa9 100644
--- a/doc/encoding.n
+++ b/doc/encoding.n
@@ -1,208 +1,93 @@
'\"
-'\" Copyright (c) 1998 Scriptics Corporation.
-'\"
+'\" Copyright (c) 1998 by Scriptics Corporation.
+'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
-'\"
+'\"
.TH encoding n "8.1" Tcl "Tcl Built-In Commands"
.so man.macros
.BS
.SH NAME
-encoding \- Work with encodings
+encoding \- Manipulate encodings
.SH SYNOPSIS
-\fBencoding \fIoperation\fR ?\fIarg arg ...\fR?
+\fBencoding \fIoption\fR ?\fIarg arg ...\fR?
.BE
+
.SH INTRODUCTION
.PP
-In Tcl every string is composed of Unicode values. Text may be encoded into an
-encoding such as cp1252, iso8859-1, Shitf\-JIS, utf-8, utf-16, etc. Not every
-Unicode vealue is encodable in every encoding, and some encodings can encode
-values that are not available in Unicode.
-.PP
-Even though Unicode is for encoding the written texts of human languages, any
-sequence of bytes can be encoded as the first 255 Unicode values. iso8859-1 an
-encoding for a subset of Unicode in which each byte is a Unicode value of 255
-or less. Thus, any sequence of bytes can be considered to be a Unicode string
-encoded in iso8859-1. To work with binary data in Tcl, decode it from
-iso8859-1 when reading it in, and encode it into iso8859-1 when writing it out,
-ensuring that each character in the string has a value of 255 or less.
-Decoding such a string does nothing, and encoding encoding such a string also
-does nothing.
-.PP
-For example, the following is true:
-.CS
-set text {In Tcl binary data is treated as Unicode text and it just works.}
-set encoded [encoding convertto iso8859-1 $text]
-expr {$text eq $encoded}; #-> 1
-.CE
-The following is also true:
-.CS
-set decoded [encoding convertfrom iso8859-1 $text]
-expr {$text eq $decoded}; #-> 1
-.CE
+Strings in Tcl are encoded using 16-bit Unicode characters. Different
+operating system interfaces or applications may generate strings in
+other encodings such as Shift-JIS. The \fBencoding\fR command helps
+to bridge the gap between Unicode and these other formats.
.SH DESCRIPTION
.PP
-Performs one of the following encoding \fIoperations\fR:
+Performs one of several encoding related operations, depending on
+\fIoption\fR. The legal \fIoption\fRs are:
.TP
\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
+Convert \fIdata\fR to Unicode from the specified \fIencoding\fR. The
+characters in \fIdata\fR are treated as binary data where the lower
+8-bits of each character is taken as a single byte. The resulting
+sequence of bytes is treated as a string in the specified
+\fIencoding\fR. If \fIencoding\fR is not specified, the current
+system encoding is used.
.TP
-\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR
-.
-Decodes \fIdata\fR encoded in \fIencoding\fR. If \fIencoding\fR is not
-specified the current system encoding is used.
-
-.VS "TCL8.7 TIP607, TIP656"
-\fB-profile\fR determines how invalid data for the encoding are handled. See
-the \fBPROFILES\fR section below for details. Returns an error if decoding
-fails. However, if \fB-failindex\fR given, returns the result of the
-conversion up to the point of termination, and stores in \fBvar\fR the index of
-the character that could not be converted. If no errors are encountered the
-entire result of the conversion is returned and the value \fB-1\fR is stored in
-\fBvar\fR.
-.VE "TCL8.7 TIP607, TIP656"
-.TP
-\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR
-.TP
-\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR
-.
-Converts \fIstring\fR to \fIencoding\fR. If \fIencoding\fR is not given, the
-current system encoding is used.
-
-.VS "TCL8.7 TIP607, TIP656"
-See \fBencoding convertfrom\fR for the meaning of \fB-profile\fR and \fB-failindex\fR.
-.VE "TCL8.7 TIP607, TIP656"
+\fBencoding convertto\fR ?\fIencoding\fR? \fIstring\fR
+Convert \fIstring\fR from Unicode to the specified \fIencoding\fR.
+The result is a sequence of bytes that represents the converted
+string. Each byte is stored in the lower 8-bits of a Unicode
+character. If \fIencoding\fR is not specified, the current
+system encoding is used.
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
-.
-Sets the search path for \fB*.enc\fR encoding data files to the list of
-directories given by \fIdirectoryList\fR. If \fIdirectoryList\fR is not given,
-returns the current list of directories that make up the search path. It is
-not an error for an item in \fIdirectoryList\fR to not refer to a readable,
-searchable directory.
+.VS 8.5
+Tcl can load encoding data files from the file system that describe
+additional encodings for it to work with. This command sets the search
+path for \fB*.enc\fR encoding data files to the list of directories
+\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the
+command returns the current list of directories that make up the
+search path. It is an error for \fIdirectoryList\fR to not be a valid
+list. If, when a search for an encoding data file is happening, an
+element in \fIdirectoryList\fR does not refer to a readable,
+searchable directory, that element is ignored.
+.VE 8.5
.TP
\fBencoding names\fR
-.
-Returns a list of the names of available encodings.
-The encodings
-.QW utf-8
-and
-.QW iso8859-1
-are guaranteed to be present in the list.
-.VS "TCL8.7 TIP656"
-.TP
-\fBencoding profiles\fR
-Returns a list of names of available encoding profiles. See \fBPROFILES\fR
-below.
-.VE "TCL8.7 TIP656"
+Returns a list containing the names of all of the encodings that are
+currently available.
.TP
\fBencoding system\fR ?\fIencoding\fR?
-.
-Sets the system encoding to \fIencoding\fR. If \fIencoding\fR is not given,
-returns the current system encoding. The system encoding is used to pass
-strings to system calls.
-.\" Do not put .VS on whole section as that messes up the bullet list alignment
-.SH PROFILES
-.PP
-.VS "TCL8.7 TIP656"
-Each \fIprofile\fR is a distinct strategy for dealing with invalid data for an
-encoding.
-.PP
-The following profiles are currently implemented.
-.VS "TCL8.7 TIP656"
-.TP
-\fBtcl8\fR
-.
-The default profile. Provides for behaviour identical to that of Tcl 8.6: When
-decoding, for encodings \fBother than utf-8\fR, each invalid byte is interpreted
-as the Unicode value given by that one byte. For example, the byte 0x80, which
-is invalid in the ASCII encoding would be mapped to the Unicode value U+0080.
-For \fButf-8\fR, each invalid byte that is a valid CP1252 character is
-interpreted as the Unicode value for that character, while each byte that is
-not is treated as the Unicode value given by that one byte. For example, byte
-0x80 is defined by CP1252 and is therefore mapped to its Unicode equivalent
-U+20AC while byte 0x81 which is not defined by CP1252 is mapped to U+0081. As
-an additional special case, the sequence 0xC0 0x80 is mapped to U+0000.
-
-When encoding, each character that cannot be represented in the encoding is
-replaced by an encoding-dependent character, usually the question mark \fB?\fR.
-.TP
-\fBstrict\fR
-.
-The operation fails when invalid data for the encoding are encountered.
-.TP
-\fBreplace\fR
-.
-When decoding, invalid bytes are replaced by U+FFFD, the Unicode REPLACEMENT
-CHARACTER.
-
-When encoding, Unicode values that cannot be represented in the target encoding
-are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT
-CHARACTER for UTF targets, and generally `?` for other encodings.
-.VE "TCL8.7 TIP656"
-.SH EXAMPLES
-.PP
-These examples use the utility proc below that prints the Unicode value for
-each character in a string.
-.PP
-.CS
-proc codepoints s {join [lmap c [split $s {}] {
- string cat U+ [format %.6X [scan $c %c]]}]
-}
-.CE
-.PP
-Example 1: Convert from euc-jp:
-.PP
+Set the system encoding to \fIencoding\fR. If \fIencoding\fR is
+omitted then the command returns the current system encoding. The
+system encoding is used whenever Tcl passes strings to system calls.
+.SH EXAMPLE
+.PP
+It is common practice to write script files using a text editor that
+produces output in the euc-jp encoding, which represents the ASCII
+characters as singe bytes and Japanese characters as two bytes. This
+makes it easy to embed literal strings that correspond to non-ASCII
+characters by simply typing the strings in place in the script.
+However, because the \fBsource\fR command always reads files using the
+current system encoding, Tcl will only source such files correctly
+when the encoding used to write the file is the same. This tends not
+to be true in an internationalized setting. For example, if such a
+file was sourced in North America (where the ISO8859-1 is normally
+used), each byte in the file would be treated as a separate character
+that maps to the 00 page in Unicode. The resulting Tcl strings will
+not contain the expected Japanese characters. Instead, they will
+contain a sequence of Latin-1 characters that correspond to the bytes
+of the original string. The \fBencoding\fR command can be used to
+convert this string to the expected Japanese Unicode characters. For
+example,
.CS
-% codepoints [\fBencoding convertfrom\fR euc-jp \exA4\exCF]
-U+00306F
+set s [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"]
.CE
-.PP
-The result is the Unicode value
+would return the Unicode string
.QW "\eu306F" ,
which is the Hiragana letter HA.
-.VS "TCL8.7 TIP607, TIP656"
-.PP
-Example 2: Error handling based on profiles:
-.PP
-The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid
-in ASCII encoding.
-.PP
-.CS
-% codepoints [encoding convertfrom -profile tcl8 ascii A\ex80]
-U+000041 U+000080
-% codepoints [encoding convertfrom -profile replace ascii A\ex80]
-U+000041 U+00FFFD
-% codepoints [encoding convertfrom -profile strict ascii A\ex80]
-unexpected byte sequence starting at index 1: '\ex80'
-.CE
-.PP
-Example 3: Get partial data and the error location:
-.PP
-.CS
-% codepoints [encoding convertfrom -profile strict -failindex idx ascii AB\ex80]
-U+000041 U+000042
-% set idx
-2
-.CE
-.PP
-Example 4: Encode a character that is not representable in ISO8859-1:
-.PP
-.CS
-% encoding convertto iso8859-1 A\eu0141
-A?
-% encoding convertto -profile strict iso8859-1 A\eu0141
-unexpected character at index 1: 'U+000141'
-% encoding convertto -profile strict -failindex idx iso8859-1 A\eu0141
-A
-% set idx
-1
-.CE
-.VE "TCL8.7 TIP607, TIP656"
-.PP
+
.SH "SEE ALSO"
-Tcl_GetEncoding(3), fconfigure(n)
+Tcl_GetEncoding(3)
+
.SH KEYWORDS
-encoding, unicode
-.\" Local Variables:
-.\" mode: nroff
-.\" End:
+encoding