diff options
Diffstat (limited to 'doc/encoding.n')
| -rw-r--r-- | doc/encoding.n | 90 |
1 files changed, 63 insertions, 27 deletions
diff --git a/doc/encoding.n b/doc/encoding.n index 5fad056..5782199 100644 --- a/doc/encoding.n +++ b/doc/encoding.n @@ -4,30 +4,38 @@ '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" -'\" RCS: @(#) $Id: encoding.n,v 1.3 2000/09/07 14:27:47 poenitz Exp $ -'\" -.so man.macros .TH encoding n "8.1" Tcl "Tcl Built-In Commands" +.so man.macros .BS .SH NAME encoding \- Manipulate encodings .SH SYNOPSIS \fBencoding \fIoption\fR ?\fIarg arg ...\fR? .BE - .SH INTRODUCTION .PP -Strings in Tcl are encoded using 16-bit Unicode characters. Different -operating system interfaces or applications may generate strings in -other encodings such as Shift-JIS. The \fBencoding\fR command helps -to bridge the gap between Unicode and these other formats. - +Strings in Tcl are logically a sequence of 16-bit Unicode characters. +These strings are represented in memory as a sequence of bytes that +may be in one of several encodings: modified UTF\-8 (which uses 1 to 3 +bytes per character), 16-bit +.QW Unicode +(which uses 2 bytes per character, with an endianness that is +dependent on the host architecture), and binary (which uses a single +byte per character but only handles a restricted range of characters). +Tcl does not guarantee to always use the same encoding for the same +string. +.PP +Different operating system interfaces or applications may generate +strings in other encodings such as Shift\-JIS. The \fBencoding\fR +command helps to bridge the gap between Unicode and these other +formats. .SH DESCRIPTION .PP Performs one of several encoding related operations, depending on \fIoption\fR. The legal \fIoption\fRs are: .TP -\fBencoding convertfrom ?\fIencoding\fR? \fIdata\fR +\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR +. Convert \fIdata\fR to Unicode from the specified \fIencoding\fR. The characters in \fIdata\fR are treated as binary data where the lower 8-bits of each character is taken as a single byte. The resulting @@ -35,22 +43,42 @@ sequence of bytes is treated as a string in the specified \fIencoding\fR. If \fIencoding\fR is not specified, the current system encoding is used. .TP -\fBencoding convertto ?\fIencoding\fR? \fIstring\fR +\fBencoding convertto\fR ?\fIencoding\fR? \fIstring\fR +. Convert \fIstring\fR from Unicode to the specified \fIencoding\fR. The result is a sequence of bytes that represents the converted string. Each byte is stored in the lower 8-bits of a Unicode -character. If \fIencoding\fR is not specified, the current -system encoding is used. +character (indeed, the resulting string is a binary string as far as +Tcl is concerned, at least initially). If \fIencoding\fR is not +specified, the current system encoding is used. +.TP +\fBencoding dirs\fR ?\fIdirectoryList\fR? +. +Tcl can load encoding data files from the file system that describe +additional encodings for it to work with. This command sets the search +path for \fB*.enc\fR encoding data files to the list of directories +\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the +command returns the current list of directories that make up the +search path. It is an error for \fIdirectoryList\fR to not be a valid +list. If, when a search for an encoding data file is happening, an +element in \fIdirectoryList\fR does not refer to a readable, +searchable directory, that element is ignored. .TP \fBencoding names\fR +. Returns a list containing the names of all of the encodings that are currently available. +The encodings +.QW utf-8 +and +.QW iso8859-1 +are guaranteed to be present in the list. .TP \fBencoding system\fR ?\fIencoding\fR? +. Set the system encoding to \fIencoding\fR. If \fIencoding\fR is omitted then the command returns the current system encoding. The system encoding is used whenever Tcl passes strings to system calls. - .SH EXAMPLE .PP It is common practice to write script files using a text editor that @@ -59,21 +87,29 @@ characters as singe bytes and Japanese characters as two bytes. This makes it easy to embed literal strings that correspond to non-ASCII characters by simply typing the strings in place in the script. However, because the \fBsource\fR command always reads files using the -ISO8859-1 encoding, Tcl will treat each byte in the file as a separate -character that maps to the 00 page in Unicode. The -resulting Tcl strings will not contain the expected Japanese -characters. Instead, they will contain a sequence of Latin-1 -characters that correspond to the bytes of the original string. The -\fBencoding\fR command can be used to convert this string to the -expected Japanese Unicode characters. For example, +current system encoding, Tcl will only source such files correctly +when the encoding used to write the file is the same. This tends not +to be true in an internationalized setting. For example, if such a +file was sourced in North America (where the ISO8859\-1 is normally +used), each byte in the file would be treated as a separate character +that maps to the 00 page in Unicode. The resulting Tcl strings will +not contain the expected Japanese characters. Instead, they will +contain a sequence of Latin-1 characters that correspond to the bytes +of the original string. The \fBencoding\fR command can be used to +convert this string to the expected Japanese Unicode characters. For +example, +.PP .CS - set s [encoding convertfrom euc-jp "\\xA4\\xCF"] +set s [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"] .CE -would return the Unicode string "\\u306F", which is the Hiragana -letter HA. - +.PP +would return the Unicode string +.QW "\eu306F" , +which is the Hiragana letter HA. .SH "SEE ALSO" Tcl_GetEncoding(3) - .SH KEYWORDS -encoding +encoding, unicode +.\" Local Variables: +.\" mode: nroff +.\" End: |
