summaryrefslogtreecommitdiffstats
path: root/doc/Utf.3
diff options
context:
space:
mode:
authorjan.nijtmans <nijtmans@users.sourceforge.net>2018-06-27 19:09:01 (GMT)
committerjan.nijtmans <nijtmans@users.sourceforge.net>2018-06-27 19:09:01 (GMT)
commit66ec3cd3a668d06b78e321578eb5f3fa1cda5031 (patch)
treea552d7e0fe3970cf84238655cf00cdff26f88208 /doc/Utf.3
parentc44181ecafaa160b94728593527ebca0260dc51f (diff)
parent7cb7a44074f18108b2cedbf4496758442149d9d5 (diff)
downloadtcl-66ec3cd3a668d06b78e321578eb5f3fa1cda5031.zip
tcl-66ec3cd3a668d06b78e321578eb5f3fa1cda5031.tar.gz
tcl-66ec3cd3a668d06b78e321578eb5f3fa1cda5031.tar.bz2
merge trunk
Diffstat (limited to 'doc/Utf.3')
-rw-r--r--doc/Utf.324
1 files changed, 13 insertions, 11 deletions
diff --git a/doc/Utf.3 b/doc/Utf.3
index 993974c..d6f892d 100644
--- a/doc/Utf.3
+++ b/doc/Utf.3
@@ -96,10 +96,9 @@ A null-terminated Unicode string.
A null-terminated Unicode string.
.AP size_t length in
The length of the UTF-8 string in bytes (not UTF-8 characters). If
--1, all bytes up to the first null byte are used.
+(size_t)-1, all bytes up to the first null byte are used.
.AP size_t uniLength in
-The length of the Unicode string in characters. Must be greater than or
-equal to 0.
+The length of the Unicode string in characters.
.AP "Tcl_DString" *dsPtr in/out
A pointer to a previously initialized \fBTcl_DString\fR.
.AP "const char" *start in
@@ -119,8 +118,8 @@ case-insensitive (1).
.SH DESCRIPTION
.PP
-These routines convert between UTF-8 strings and Tcl_UniChars. A
-Tcl_UniChar is a Unicode character represented as an unsigned, fixed-size
+These routines convert between UTF-8 strings and Unicode characters. An
+Unicode character represented as an unsigned, fixed-size
quantity. A UTF-8 character is a Unicode character represented as
a varying-length sequence of up to \fBTCL_UTF_MAX\fR bytes. A multibyte UTF-8
sequence consists of a lead byte followed by some number of trail bytes.
@@ -128,9 +127,12 @@ sequence consists of a lead byte followed by some number of trail bytes.
\fBTCL_UTF_MAX\fR is the maximum number of bytes that it takes to
represent one Unicode character in the UTF-8 representation.
.PP
-\fBTcl_UniCharToUtf\fR stores the Tcl_UniChar \fIch\fR as a UTF-8 string
+\fBTcl_UniCharToUtf\fR stores the character \fIch\fR as a UTF-8 string
in starting at \fIbuf\fR. The return value is the number of bytes stored
-in \fIbuf\fR.
+in \fIbuf\fR. If ch is an upper surrogate (range U+D800 - U+DBFF), then
+the return value will be 0 and nothing will be stored. If you still
+want to produce UTF-8 output for it (even though knowing it's an illegal
+code-point on its own), just call \fBTcl_UniCharToUtf\fR again using ch = -1.
.PP
\fBTcl_UtfToUniChar\fR reads one UTF-8 character starting at \fIsrc\fR
and stores it as a Tcl_UniChar in \fI*chPtr\fR. The return value is the
@@ -201,7 +203,7 @@ of \fIlength\fR bytes is long enough to be decoded by
\fBTcl_UtfToUniChar\fR, or 0 otherwise. This function does not guarantee
that the UTF-8 string is properly formed. This routine is used by
procedures that are operating on a byte at a time and need to know if a
-full Tcl_UniChar has been seen.
+full Unicode character has been seen.
.PP
\fBTcl_NumUtfChars\fR corresponds to \fBstrlen\fR for UTF-8 strings. It
returns the number of Tcl_UniChars that are represented by the UTF-8 string
@@ -209,12 +211,12 @@ returns the number of Tcl_UniChars that are represented by the UTF-8 string
length is negative, all bytes up to the first null byte are used.
.PP
\fBTcl_UtfFindFirst\fR corresponds to \fBstrchr\fR for UTF-8 strings. It
-returns a pointer to the first occurrence of the Tcl_UniChar \fIch\fR
+returns a pointer to the first occurrence of the Unicode character \fIch\fR
in the null-terminated UTF-8 string \fIsrc\fR. The null terminator is
considered part of the UTF-8 string.
.PP
\fBTcl_UtfFindLast\fR corresponds to \fBstrrchr\fR for UTF-8 strings. It
-returns a pointer to the last occurrence of the Tcl_UniChar \fIch\fR
+returns a pointer to the last occurrence of the Unicode character \fIch\fR
in the null-terminated UTF-8 string \fIsrc\fR. The null terminator is
considered part of the UTF-8 string.
.PP
@@ -241,7 +243,7 @@ characters.
\fBTcl_UtfAtIndex\fR returns a pointer to the specified character (not
byte) \fIindex\fR in the UTF-8 string \fIsrc\fR. The source string must
contain at least \fIindex\fR characters. This is equivalent to calling
-\fBTcl_UtfNext\fR \fIindex\fR times. If \fIindex\fR is -1,
+\fBTcl_UtfNext\fR \fIindex\fR times. If \fIindex\fR is (size_t)-1,
the return pointer points to the first character in the source string.
.PP
\fBTcl_UtfBackslash\fR is a utility procedure used by several of the Tcl