summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authormax <max@tclers.tk>2014-02-26 12:44:12 (GMT)
committermax <max@tclers.tk>2014-02-26 12:44:12 (GMT)
commit7b8b2d52e5298c10a227114f17db436bacceb56c (patch)
tree153a518a388a7c6e6f9ec74fdb60a20a43484568 /doc
parentb58c67dddc5793d12e85d5b0066a4660d2b08671 (diff)
parent259729fa361e6d184ef91be067a93309e14cd998 (diff)
downloadtcl-7b8b2d52e5298c10a227114f17db436bacceb56c.zip
tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.gz
tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.bz2
Merge trunk
Diffstat (limited to 'doc')
-rw-r--r--doc/Tcl.n39
-rw-r--r--doc/encoding.n34
-rw-r--r--doc/string.n28
-rw-r--r--doc/tclvars.n2
4 files changed, 74 insertions, 29 deletions
diff --git a/doc/Tcl.n b/doc/Tcl.n
index 8b17f93..c7fa9f6 100644
--- a/doc/Tcl.n
+++ b/doc/Tcl.n
@@ -108,8 +108,8 @@ Variable substitution may take any of the following forms:
\fIName\fR is the name of a scalar variable; the name is a sequence
of one or more characters that are a letter, digit, underscore,
or namespace separators (two or more colons).
-Letters and digits are \fIonly\fR the standard ASCII ones (\fB0\fR\-\fB9\fR,
-\fBA\fR\-\fBZ\fR and \fBa\fR\-\fBz\fR).
+Letters and digits are \fIonly\fR the standard ASCII ones (\fB0\fR\(en\fB9\fR,
+\fBA\fR\(en\fBZ\fR and \fBa\fR\(en\fBz\fR).
.TP 15
\fB$\fIname\fB(\fIindex\fB)\fR
.
@@ -117,8 +117,8 @@ Letters and digits are \fIonly\fR the standard ASCII ones (\fB0\fR\-\fB9\fR,
the name of an element within that array.
\fIName\fR must contain only letters, digits, underscores, and
namespace separators, and may be an empty string.
-Letters and digits are \fIonly\fR the standard ASCII ones (\fB0\fR\-\fB9\fR,
-\fBA\fR\-\fBZ\fR and \fBa\fR\-\fBz\fR).
+Letters and digits are \fIonly\fR the standard ASCII ones (\fB0\fR\(en\fB9\fR,
+\fBA\fR\(en\fBZ\fR and \fBa\fR\(en\fBz\fR).
Command substitutions, variable substitutions, and backslash
substitutions are performed on the characters of \fIindex\fR.
.TP 15
@@ -158,25 +158,25 @@ handled specially, along with the value that replaces each sequence.
.RS
.TP 7
\e\fBa\fR
-Audible alert (bell) (0x7).
+Audible alert (bell) (Unicode U+000007).
.TP 7
\e\fBb\fR
-Backspace (0x8).
+Backspace (Unicode U+000008).
.TP 7
\e\fBf\fR
-Form feed (0xc).
+Form feed (Unicode U+00000C).
.TP 7
\e\fBn\fR
-Newline (0xa).
+Newline (Unicode U+00000A).
.TP 7
\e\fBr\fR
-Carriage-return (0xd).
+Carriage-return (Unicode U+00000D).
.TP 7
\e\fBt\fR
-Tab (0x9).
+Tab (Unicode U+000009).
.TP 7
\e\fBv\fR
-Vertical tab (0xb).
+Vertical tab (Unicode U+00000B).
.TP 7
\e\fB<newline>\fIwhiteSpace\fR
.
@@ -194,8 +194,9 @@ Backslash
\e\fIooo\fR
.
The digits \fIooo\fR (one, two, or three of them) give a eight-bit octal
-value for the Unicode character that will be inserted, in the range \fI000\fR
-- \fI377\fR. The parser will stop just before this range overflows, or when
+value for the Unicode character that will be inserted, in the range
+\fI000\fR\(en\fI377\fR (i.e., the range U+000000\(enU+0000FF).
+The parser will stop just before this range overflows, or when
the maximum of three digits is reached. The upper bits of the Unicode
character will be 0.
.TP 7
@@ -203,23 +204,27 @@ character will be 0.
.
The hexadecimal digits \fIhh\fR (one or two of them) give an eight-bit
hexadecimal value for the Unicode character that will be inserted. The upper
-bits of the Unicode character will be 0.
+bits of the Unicode character will be 0 (i.e., the character will be in the
+range U+000000\(enU+0000FF).
.TP 7
\e\fBu\fIhhhh\fR
.
The hexadecimal digits \fIhhhh\fR (one, two, three, or four of them) give a
sixteen-bit hexadecimal value for the Unicode character that will be
-inserted. The upper bits of the Unicode character will be 0.
+inserted. The upper bits of the Unicode character will be 0 (i.e., the
+character will be in the range U+000000\(enU+00FFFF).
.TP 7
\e\fBU\fIhhhhhhhh\fR
.
The hexadecimal digits \fIhhhhhhhh\fR (one up to eight of them) give a
twenty-one-bit hexadecimal value for the Unicode character that will be
-inserted, in the range U+0000..U+10FFFF. The parser will stop just
+inserted, in the range U+000000\(enU+10FFFF. The parser will stop just
before this range overflows, or when the maximum of eight digits
is reached. The upper bits of the Unicode character will be 0.
+.RS
.PP
-The range U+010000..U+10FFFD is reserved for the future.
+The range U+010000\(enU+10FFFD is reserved for the future.
+.RE
.PP
Backslash substitution is not performed on words enclosed in braces,
except for backslash-newline as described above.
diff --git a/doc/encoding.n b/doc/encoding.n
index be1dc3f..5782199 100644
--- a/doc/encoding.n
+++ b/doc/encoding.n
@@ -14,10 +14,21 @@ encoding \- Manipulate encodings
.BE
.SH INTRODUCTION
.PP
-Strings in Tcl are encoded using 16-bit Unicode characters. Different
-operating system interfaces or applications may generate strings in
-other encodings such as Shift-JIS. The \fBencoding\fR command helps
-to bridge the gap between Unicode and these other formats.
+Strings in Tcl are logically a sequence of 16-bit Unicode characters.
+These strings are represented in memory as a sequence of bytes that
+may be in one of several encodings: modified UTF\-8 (which uses 1 to 3
+bytes per character), 16-bit
+.QW Unicode
+(which uses 2 bytes per character, with an endianness that is
+dependent on the host architecture), and binary (which uses a single
+byte per character but only handles a restricted range of characters).
+Tcl does not guarantee to always use the same encoding for the same
+string.
+.PP
+Different operating system interfaces or applications may generate
+strings in other encodings such as Shift\-JIS. The \fBencoding\fR
+command helps to bridge the gap between Unicode and these other
+formats.
.SH DESCRIPTION
.PP
Performs one of several encoding related operations, depending on
@@ -37,8 +48,9 @@ system encoding is used.
Convert \fIstring\fR from Unicode to the specified \fIencoding\fR.
The result is a sequence of bytes that represents the converted
string. Each byte is stored in the lower 8-bits of a Unicode
-character. If \fIencoding\fR is not specified, the current
-system encoding is used.
+character (indeed, the resulting string is a binary string as far as
+Tcl is concerned, at least initially). If \fIencoding\fR is not
+specified, the current system encoding is used.
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
.
@@ -56,6 +68,11 @@ searchable directory, that element is ignored.
.
Returns a list containing the names of all of the encodings that are
currently available.
+The encodings
+.QW utf-8
+and
+.QW iso8859-1
+are guaranteed to be present in the list.
.TP
\fBencoding system\fR ?\fIencoding\fR?
.
@@ -73,7 +90,7 @@ However, because the \fBsource\fR command always reads files using the
current system encoding, Tcl will only source such files correctly
when the encoding used to write the file is the same. This tends not
to be true in an internationalized setting. For example, if such a
-file was sourced in North America (where the ISO8859-1 is normally
+file was sourced in North America (where the ISO8859\-1 is normally
used), each byte in the file would be treated as a separate character
that maps to the 00 page in Unicode. The resulting Tcl strings will
not contain the expected Japanese characters. Instead, they will
@@ -93,3 +110,6 @@ which is the Hiragana letter HA.
Tcl_GetEncoding(3)
.SH KEYWORDS
encoding, unicode
+.\" Local Variables:
+.\" mode: nroff
+.\" End:
diff --git a/doc/string.n b/doc/string.n
index 76005fc..163abdd 100644
--- a/doc/string.n
+++ b/doc/string.n
@@ -343,10 +343,13 @@ misleading.
\fBstring bytelength \fIstring\fR
.
Returns a decimal string giving the number of bytes used to represent
-\fIstring\fR in memory. Because UTF\-8 uses one to three bytes to
-represent Unicode characters, the byte length will not be the same as
-the character length in general. The cases where a script cares about
-the byte length are rare.
+\fIstring\fR in memory when encoded as Tcl's internal modified UTF\-8;
+Tcl may use other encodings for \fIstring\fR as well, and does not
+guarantee to only use a single encoding for a particular \fIstring\fR.
+Because UTF\-8 uses a variable number of bytes to represent Unicode
+characters, the byte length will not be the same as the character
+length in general. The cases where a script cares about the byte
+length are rare.
.RS
.PP
In almost all cases, you should use the
@@ -354,10 +357,27 @@ In almost all cases, you should use the
Tcl byte array value). Refer to the \fBTcl_NumUtfChars\fR manual
entry for more details on the UTF\-8 representation.
.PP
+Formally, the \fBstring bytelength\fR operation returns the content of
+the \fIlength\fR field of the \fBTcl_Obj\fR structure, after calling
+\fBTcl_GetString\fR to ensure that the \fIbytes\fR field is populated.
+This is highly unlikely to be useful to Tcl scripts, as Tcl's internal
+encoding is not strict UTF\-8, but rather a modified CESU\-8 with a
+denormalized NUL (identical to that used in a number of places by
+Java's serialization mechanism) to enable basic processing with
+non-Unicode-aware C functions. As this representation should only
+ever be used by Tcl's implementation, the number of bytes used to
+store the representation is of very low value (except to C extension
+code, which has direct access for the purpose of memory management,
+etc.)
+.PP
\fICompatibility note:\fR it is likely that this subcommand will be
withdrawn in a future version of Tcl. It is better to use the
\fBencoding convertto\fR command to convert a string to a known
encoding and then apply \fBstring length\fR to that.
+.PP
+.CS
+\fBstring length\fR [encoding convertto utf-8 $theString]
+.CE
.RE
.TP
\fBstring wordend \fIstring charIndex\fR
diff --git a/doc/tclvars.n b/doc/tclvars.n
index 2fec222..9d7a4ce 100644
--- a/doc/tclvars.n
+++ b/doc/tclvars.n
@@ -10,7 +10,7 @@
.BS
'\" Note: do not modify the .SH NAME line immediately below!
.SH NAME
-argc, argv, argv0, auto_path, env, errorCode, errorInfo, tcl_interactive, tcl_library, tcl_nonwordchars, tcl_patchLevel, tcl_pkgPath, tcl_platform, tcl_precision, tcl_rcFileName, tcl_traceCompile, tcl_traceEval, tcl_wordchars, tcl_version \- Variables used by Tcl
+argc, argv, argv0, auto_path, env, errorCode, errorInfo, tcl_interactive, tcl_library, tcl_nonwordchars, tcl_patchLevel, tcl_pkgPath, tcl_platform, tcl_precision, tcl_rcFileName, tcl_traceCompile, tcl_traceExec, tcl_wordchars, tcl_version \- Variables used by Tcl
.BE
.SH DESCRIPTION
.PP