Merge trunk

author: max <max@tclers.tk> 2014-02-26 12:44:12 (GMT)
committer: max <max@tclers.tk> 2014-02-26 12:44:12 (GMT)
commit: 7b8b2d52e5298c10a227114f17db436bacceb56c (patch)
tree: 153a518a388a7c6e6f9ec74fdb60a20a43484568 /doc/string.n
parent: b58c67dddc5793d12e85d5b0066a4660d2b08671 (diff)
parent: 259729fa361e6d184ef91be067a93309e14cd998 (diff)
download: tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.zip
tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.gz
tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.bz2
1 files changed, 24 insertions, 4 deletions
diff --git a/doc/string.n b/doc/string.n
index 76005fc..163abdd 100644
--- a/doc/string.n
+++ b/doc/string.n
@@ -343,10 +343,13 @@ misleading.
 \fBstring bytelength \fIstring\fR
 .
 Returns a decimal string giving the number of bytes used to represent
-\fIstring\fR in memory.  Because UTF\-8 uses one to three bytes to
-represent Unicode characters, the byte length will not be the same as
-the character length in general.  The cases where a script cares about
-the byte length are rare.
+\fIstring\fR in memory when encoded as Tcl's internal modified UTF\-8;
+Tcl may use other encodings for \fIstring\fR as well, and does not
+guarantee to only use a single encoding for a particular \fIstring\fR.
+Because UTF\-8 uses a variable number of bytes to represent Unicode
+characters, the byte length will not be the same as the character
+length in general.  The cases where a script cares about the byte
+length are rare.
 .RS
 .PP
 In almost all cases, you should use the
@@ -354,10 +357,27 @@ In almost all cases, you should use the
 Tcl byte array value).  Refer to the \fBTcl_NumUtfChars\fR manual
 entry for more details on the UTF\-8 representation.
 .PP
+Formally, the \fBstring bytelength\fR operation returns the content of
+the \fIlength\fR field of the \fBTcl_Obj\fR structure, after calling
+\fBTcl_GetString\fR to ensure that the \fIbytes\fR field is populated.
+This is highly unlikely to be useful to Tcl scripts, as Tcl's internal
+encoding is not strict UTF\-8, but rather a modified CESU\-8 with a
+denormalized NUL (identical to that used in a number of places by
+Java's serialization mechanism) to enable basic processing with
+non-Unicode-aware C functions.  As this representation should only
+ever be used by Tcl's implementation, the number of bytes used to
+store the representation is of very low value (except to C extension
+code, which has direct access for the purpose of memory management,
+etc.)
+.PP
 \fICompatibility note:\fR it is likely that this subcommand will be
 withdrawn in a future version of Tcl. It is better to use the
 \fBencoding convertto\fR command to convert a string to a known
 encoding and then apply \fBstring length\fR to that.
+.PP
+.CS
+\fBstring length\fR [encoding convertto utf-8 $theString]
+.CE
 .RE
 .TP
 \fBstring wordend \fIstring charIndex\fR
author	max <max@tclers.tk>	2014-02-26 12:44:12 (GMT)
committer	max <max@tclers.tk>	2014-02-26 12:44:12 (GMT)
commit	7b8b2d52e5298c10a227114f17db436bacceb56c (patch)
tree	153a518a388a7c6e6f9ec74fdb60a20a43484568 /doc/string.n
parent	b58c67dddc5793d12e85d5b0066a4660d2b08671 (diff)
parent	259729fa361e6d184ef91be067a93309e14cd998 (diff)
download	tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.zip tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.gz tcl-7b8b2d52e5298c10a227114f17db436bacceb56c.tar.bz2