Added long comment explaining history and work in progress making bytearray

interfaces usable.
author: dgp <dgp@users.sourceforge.net> 2016-12-02 18:18:59 (GMT)
committer: dgp <dgp@users.sourceforge.net> 2016-12-02 18:18:59 (GMT)
commit: c9bee15d7360f859a1916da19733e2595caad00b (patch)
tree: 8cf04c1638aebb10089739405b83a22842184c85 /generic
parent: 1782f8677925d45e83402a7f1fac4f4b0e1e367b (diff)
download: tcl-c9bee15d7360f859a1916da19733e2595caad00b.zip
tcl-c9bee15d7360f859a1916da19733e2595caad00b.tar.gz
tcl-c9bee15d7360f859a1916da19733e2595caad00b.tar.bz2
1 files changed, 87 insertions, 22 deletions
diff --git a/generic/tclBinary.c b/generic/tclBinary.c
index cb4534d..4d063b2 100644
--- a/generic/tclBinary.c
+++ b/generic/tclBinary.c
@@ -155,28 +155,93 @@ static const EnsembleImplMap decodeMap[] = {
 };
 
 /*
- * The following object type represents an array of bytes. An array of bytes
- * is not equivalent to an internationalized string. Conceptually, a string is
- * an array of 16-bit quantities organized as a sequence of properly formed
- * UTF-8 characters, while a ByteArray is an array of 8-bit quantities.
- * Accessor functions are provided to convert a ByteArray to a String or a
- * String to a ByteArray. Two or more consecutive bytes in an array of bytes
- * may look like a single UTF-8 character if the array is casually treated as
- * a string. But obtaining the String from a ByteArray is guaranteed to
- * produced properly formed UTF-8 sequences so that there is a one-to-one map
- * between bytes and characters.
- *
- * Converting a ByteArray to a String proceeds by casting each byte in the
- * array to a 16-bit quantity, treating that number as a Unicode character,
- * and storing the UTF-8 version of that Unicode character in the String. For
- * ByteArrays consisting entirely of values 1..127, the corresponding String
- * representation is the same as the ByteArray representation.
- *
- * Converting a String to a ByteArray proceeds by getting the Unicode
- * representation of each character in the String, casting it to a byte by
- * truncating the upper 8 bits, and then storing the byte in the ByteArray.
- * Converting from ByteArray to String and back to ByteArray is not lossy, but
- * converting an arbitrary String to a ByteArray may be.
+ * The following object types represent an array of bytes. The intent is
+ * to allow arbitrary binary data to pass through Tcl as a Tcl value
+ * without loss or damage. Such values are useful for things like
+ * encoded strings or Tk images to name just two.
+ *
+ * It's strange to have two Tcl_ObjTypes in place for this task when
+ * one would do, so a bit of detail and history how we got to this point
+ * and where we might go from here.
+ *
+ * A bytearray is an ordered sequence of bytes. Each byte is an integer
+ * value in the range [0-255].  To be a Tcl value type, we need a way to
+ * encode each value in the value set as a Tcl string.  The simplest
+ * encoding is to represent each byte value as the same codepoint value.
+ * A bytearray of N bytes is encoded into a Tcl string of N characters
+ * where the codepoint of each character is the value of corresponding byte.
+ * This approach creates a one-to-one map between all bytearray values
+ * and a subset of Tcl string values.
+ * 
+ * When converting a Tcl string value to the bytearray internal rep, the
+ * question arises what to do with strings outside that subset?  That is,
+ * those Tcl strings containing at least one codepoint greater than 255?
+ * The obviously correct answer is to raise an error!  That string value
+ * does not represent any valid bytearray value. Full Stop.  The
+ * setFromAnyProc signature has a completion code return value for just
+ * this reason, to reject invalid inputs.
+ * 
+ * Unfortunately this was not the path taken by the authors of the
+ * original tclByteArrayType.  They chose to accept all Tcl string values
+ * as acceptable string encodings of the bytearray values that result
+ * from masking away the high bits of any codepoint value at all. This
+ * meant that every bytearray value had multiple accepted string
+ * representations.
+ *
+ * The implications of this choice are truly ugly.  When a Tcl value has
+ * a string representation, we are required to accept that as the true
+ * value.  Bytearray values that possess a string representation cannot
+ * be processed as bytearrays because we cannot know which true value
+ * that bytearray represents.  The consequence is that we drag around
+ * an internal rep that we cannot make any use of.  This painful price
+ * is extracted at any point after a string rep happens to be generated
+ * for the value.  This happens even when the troublesome codepoints
+ * outside the byte range never show up.  This happens rather routinely
+ * in normal Tcl operations unless we burden the script writer with the
+ * cognitive burden of avoiding it.  The price is also paid by callers
+ * of the C interface.  The routine
+ *
+ *	unsigned char *Tcl_GetByteArrayFromObj(objPtr, lenPtr)
+ *
+ * has a guarantee to always return a non-NULL value, but that value
+ * points to a byte sequence that cannot be used by the caller to  
+ * process the Tcl value absent some sideband testing that objPtr
+ * is "pure".  Tcl offers no public interface to perform this test,
+ * so callers either break encapsulation or are unavoidably buggy.  Tcl
+ * has defined a public interface that cannot be used correctly. The
+ * Tcl source code itself suffers the same problem, and has been buggy,
+ * but progressively less so as more and more portions of the code have
+ * been retrofitted with the required "purity testing".  The set of values
+ * able to pass the purity test can be increased via the introduction of
+ * a "canonical" flag marker, but the only way the broken interface itself
+ * can be discarded is to start over and define the Tcl_ObjType properly.
+ * Bytearrays should simply be usable as bytearrays without a kabuki
+ * dance of testing.
+ *
+ * The Tcl_ObjType "properByteArrayType" is (nearly) a correct 
+ * implementation of bytearrays.  Any Tcl value with the type
+ * properByteArrayType can have its bytearray value fetched and
+ * used with confidence that acting on that value is equivalent to
+ * acting on the true Tcl string value.  This still implies a side
+ * testing burden -- past mistakes will not let us avoid that
+ * immediately, but it is at least a conventional test of type, and
+ * can be implemented entirely by examining the objPtr fields, with
+ * no need to query the intrep, as a canonical flag would require.
+ *
+ * Until Tcl_GetByteArrayFromObj() and Tcl_SetByteArrayLength() can
+ * be revised to admit the possibility of returning NULL when the true
+ * value is not a valid bytearray, we need a mechanism to retain
+ * compatibility with the deployed callers of the broken interface.
+ * That's what the retained "tclByteArrayType" provides.  In those
+ * unusual circumstances where we convert an invalid bytearray value
+ * to a bytearray type, it is to this legacy type.  Essentially any
+ * time this legacy type gets used, it's a signal of a bug being ignored.
+ * A TIP should be drafted to remove this connection to the broken past
+ * so that Tcl 9 will no longer have any trace of it.  Prescribing a
+ * migration path will be the key element of that work.  The internal
+ * changes now in place are the limit of what can be done short of
+ * interface repair.  They provide a great expansion of the histories
+ * over which bytearray values can be useful in the meanwhile.
  */
 
 static const Tcl_ObjType properByteArrayType = {
author	dgp <dgp@users.sourceforge.net>	2016-12-02 18:18:59 (GMT)
committer	dgp <dgp@users.sourceforge.net>	2016-12-02 18:18:59 (GMT)
commit	c9bee15d7360f859a1916da19733e2595caad00b (patch)
tree	8cf04c1638aebb10089739405b83a22842184c85 /generic
parent	1782f8677925d45e83402a7f1fac4f4b0e1e367b (diff)
download	tcl-c9bee15d7360f859a1916da19733e2595caad00b.zip tcl-c9bee15d7360f859a1916da19733e2595caad00b.tar.gz tcl-c9bee15d7360f859a1916da19733e2595caad00b.tar.bz2