Marc-Andre Lemburg: Python Unicode integration proposal, version 1.2.

author: Guido van Rossum <guido@python.org> 2000-03-10 23:14:11 (GMT)
committer: Guido van Rossum <guido@python.org> 2000-03-10 23:14:11 (GMT)
commit: 9ed0d1ef18321f8939cd899276bba27cb61e5c3a (patch)
tree: 1d26cde56f6ff67d6c126d7628e08712dcd9d8c6 /Misc
parent: e141fd84e96abf8eb509e7c4d5503fb5cd972758 (diff)
download: cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.zip
cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.gz
cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.bz2
1 files changed, 885 insertions, 0 deletions
diff --git a/Misc/unicode.txt b/Misc/unicode.txt
new file mode 100644
index 0000000..b31beef
--- /dev/null
+++ b/Misc/unicode.txt
@@ -0,0 +1,885 @@
+=============================================================================
+ Python Unicode Integration                            Proposal Version: 1.2
+-----------------------------------------------------------------------------
+
+
+Introduction:
+-------------
+
+The idea of this proposal is to add native Unicode 3.0 support to
+Python in a way that makes use of Unicode strings as simple as
+possible without introducing too many pitfalls along the way.
+
+Since this goal is not easy to achieve -- strings being one of the
+most fundamental objects in Python --, we expect this proposal to
+undergo some significant refinements.
+
+Note that the current version of this proposal is still a bit unsorted
+due to the many different aspects of the Unicode-Python integration.
+
+The latest version of this document is always available at:
+
+        http://starship.skyport.net/~lemburg/unicode-proposal.txt
+
+Older versions are available as:
+
+        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt
+
+
+Conventions:
+------------
+
+· In examples we use u = Unicode object and s = Python string
+
+· 'XXX' markings indicate points of discussion (PODs)
+
+
+General Remarks:
+----------------
+
+· Unicode encoding names should be lower case on output and
+  case-insensitive on input (they will be converted to lower case
+  by all APIs taking an encoding name as input).
+
+  Encoding names should follow the name conventions as used by the
+  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
+  written as 'utf-16'.
+
+  Codec modules should use the same names, but with hyphens converted
+  to underscores, e.g. utf_8, utf_16, iso_8859_1.
+
+· The <default encoding> should be the widely used 'utf-8' format. This
+  is very close to the standard 7-bit ASCII format and thus resembles the
+  standard used programming nowadays in most aspects.
+
+
+Unicode Constructors:
+---------------------
+
+Python should provide a built-in constructor for Unicode strings which
+is available through __builtins__:
+
+  u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
+
+  u = u'<unicode-escape encoded Python string>'
+
+  u = ur'<raw-unicode-escape encoded Python string>'
+
+With the 'unicode-escape' encoding being defined as:
+
+· all non-escape characters represent themselves as Unicode ordinal
+  (e.g. 'a' -> U+0061).
+
+· all existing defined Python escape sequences are interpreted as
+  Unicode ordinals; note that \xXXXX can represent all Unicode
+  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
+
+· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
+  error to have fewer than 4 digits after \u.
+
+For an explanation of possible values for errors see the Codec section
+below.
+
+Examples:
+
+u'abc'          -> U+0061 U+0062 U+0063
+u'\u1234'       -> U+1234
+u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c
+
+The 'raw-unicode-escape' encoding is defined as follows:
+
+· \uXXXX sequence represent the U+XXXX Unicode character if and
+  only if the number of leading backslashes is odd
+
+· all other characters represent themselves as Unicode ordinal
+  (e.g. 'b' -> U+0062)
+
+
+Note that you should provide some hint to the encoding you used to
+write your programs as pragma line in one the first few comment lines
+of the source file (e.g. '# source file encoding: latin-1'). If you
+only use 7-bit ASCII then everything is fine and no such notice is
+needed, but if you include Latin-1 characters not defined in ASCII, it
+may well be worthwhile including a hint since people in other
+countries will want to be able to read you source strings too.
+
+
+Unicode Type Object:
+--------------------
+
+Unicode objects should have the type UnicodeType with type name
+'unicode', made available through the standard types module.
+
+
+Unicode Output:
+---------------
+
+Unicode objects have a method .encode([encoding=<default encoding>])
+which returns a Python string encoding the Unicode string using the
+given scheme (see Codecs).
+
+  print u := print u.encode()   # using the <default encoding>
+ 
+  str(u)  := u.encode()         # using the <default encoding>
+
+  repr(u) := "u%s" % repr(u.encode('unicode-escape'))
+
+Also see Internal Argument Parsing and Buffer Interface for details on
+how other APIs written in C will treat Unicode objects.
+
+
+Unicode Ordinals:
+-----------------
+
+Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
+should provide 32-bit aware ordinal conversion APIs:
+
+  ord(u[:1]) (this is the standard ord() extended to work with Unicode
+              objects)
+        --> Unicode ordinal number (32-bit)
+
+  unichr(i) 
+        --> Unicode object for character i (provided it is 32-bit);
+            ValueError otherwise
+
+Both APIs should go into __builtins__ just like their string
+counterparts ord() and chr().
+
+Note that Unicode provides space for private encodings. Usage of these
+can cause different output representations on different machines. This
+problem is not a Python or Unicode problem, but a machine setup and
+maintenance one.
+
+
+Comparison & Hash Value:
+------------------------
+
+Unicode objects should compare equal to other objects after these
+other objects have been coerced to Unicode. For strings this means
+that they are interpreted as Unicode string using the <default
+encoding>.
+
+For the same reason, Unicode objects should return the same hash value
+as their UTF-8 equivalent strings.
+
+Coercion:
+---------
+
+Using Python strings and Unicode objects to form new objects should
+always coerce to the more precise format, i.e. Unicode objects.
+
+  u + s := u + unicode(s)
+
+  s + u := unicode(s) + u
+
+All string methods should delegate the call to an equivalent Unicode
+object method call by converting all envolved strings to Unicode and
+then applying the arguments to the Unicode method of the same name,
+e.g.
+
+  string.join((s,u),sep) := (s + sep) + u
+
+  sep.join((s,u)) := (s + sep) + u
+
+For a discussion of %-formatting w/r to Unicode objects, see
+Formatting Markers.
+
+
+Exceptions:
+-----------
+
+UnicodeError is defined in the exceptions module as subclass of
+ValueError. It is available at the C level via PyExc_UnicodeError.
+All exceptions related to Unicode encoding/decoding should be
+subclasses of UnicodeError.
+
+
+Codecs (Coder/Decoders) Lookup:
+-------------------------------
+
+A Codec (see Codec Interface Definition) search registry should be
+implemented by a module "codecs":
+
+  codecs.register(search_function)
+
+Search functions are expected to take one argument, the encoding name
+in all lower case letters, and return a tuple of functions (encoder,
+decoder, stream_reader, stream_writer) taking the following arguments:
+
+  encoder and decoder:
+	These must be functions or methods which have the same
+	interface as the .encode/.decode methods of Codec instances
+	(see Codec Interface). The functions/methods are expected to
+	work in a stateless mode.
+
+  stream_reader and stream_writer:
+	These need to be factory functions with the following
+	interface:
+
+	        factory(stream,errors='strict')
+
+        The factory functions must return objects providing
+        the interfaces defined by StreamWriter/StreamReader resp.
+        (see Codec Interface). Stream codecs can maintain state.
+
+	Possible values for errors are defined in the Codec
+	section below.
+
+In case a search function cannot find a given encoding, it should
+return None.
+
+Aliasing support for encodings is left to the search functions
+to implement.
+
+The codecs module will maintain an encoding cache for performance
+reasons. Encodings are first looked up in the cache. If not found, the
+list of registered search functions is scanned. If no codecs tuple is
+found, a LookupError is raised. Otherwise, the codecs tuple is stored
+in the cache and returned to the caller.
+
+To query the Codec instance the following API should be used:
+
+  codecs.lookup(encoding)
+
+This will either return the found codecs tuple or raise a LookupError.
+
+
+Standard Codecs:
+----------------
+
+Standard codecs should live inside an encodings/ package directory in the
+Standard Python Code Library. The __init__.py file of that directory should
+include a Codec Lookup compatible search function implementing a lazy module
+based codec lookup.
+
+Python should provide a few standard codecs for the most relevant
+encodings, e.g. 
+
+  'utf-8':              8-bit variable length encoding
+  'utf-16':             16-bit variable length encoding (litte/big endian)
+  'utf-16-le':          utf-16 but explicitly little endian
+  'utf-16-be':          utf-16 but explicitly big endian
+  'ascii':              7-bit ASCII codepage
+  'iso-8859-1':         ISO 8859-1 (Latin 1) codepage
+  'unicode-escape':     See Unicode Constructors for a definition
+  'raw-unicode-escape': See Unicode Constructors for a definition
+  'native':             Dump of the Internal Format used by Python
+
+Common aliases should also be provided per default, e.g.  'latin-1'
+for 'iso-8859-1'.
+
+Note: 'utf-16' should be implemented by using and requiring byte order
+marks (BOM) for file input/output.
+
+All other encodings such as the CJK ones to support Asian scripts
+should be implemented in seperate packages which do not get included
+in the core Python distribution and are not a part of this proposal.
+
+
+Codecs Interface Definition:
+----------------------------
+
+The following base class should be defined in the module
+"codecs". They provide not only templates for use by encoding module
+implementors, but also define the interface which is expected by the
+Unicode implementation.
+
+Note that the Codec Interface defined here is well suitable for a
+larger range of applications. The Unicode implementation expects
+Unicode objects on input for .encode() and .write() and character
+buffer compatible objects on input for .decode(). Output of .encode()
+and .read() should be a Python string and .decode() must return an
+Unicode object.
+
+First, we have the stateless encoders/decoders. These do not work in
+chunks as the stream codecs (see below) do, because all components are
+expected to be available in memory.
+
+class Codec:
+
+    """ Defines the interface for stateless encoders/decoders.
+
+        The .encode()/.decode() methods may implement different error
+        handling schemes by providing the errors argument. These
+        string values are defined:
+
+         'strict' - raise an error (or a subclass)
+         'ignore' - ignore the character and continue with the next
+         'replace' - replace with a suitable replacement character;
+                    Python will use the official U+FFFD REPLACEMENT
+                    CHARACTER for the builtin Unicode codecs.
+
+    """
+    def encode(self,input,errors='strict'):
+        
+        """ Encodes the object intput and returns a tuple (output
+            object, length consumed).
+
+            errors defines the error handling to apply. It defaults to
+            'strict' handling.
+
+            The method may not store state in the Codec instance. Use
+            SteamCodec for codecs which have to keep state in order to
+            make encoding/decoding efficient.
+
+        """
+	...
+
+    def decode(self,input,errors='strict'):
+
+        """ Decodes the object input and returns a tuple (output
+            object, length consumed).
+
+            input must be an object which provides the bf_getreadbuf
+            buffer slot. Python strings, buffer objects and memory
+            mapped files are examples of objects providing this slot.
+        
+            errors defines the error handling to apply. It defaults to
+            'strict' handling.
+
+            The method may not store state in the Codec instance. Use
+            SteamCodec for codecs which have to keep state in order to
+            make encoding/decoding efficient.
+
+        """ 
+        ...
+
+StreamWriter and StreamReader define the interface for stateful
+encoders/decoders which work on streams. These allow processing of the
+data in chunks to efficiently use memory. If you have large strings in
+memory, you may want to wrap them with cStringIO objects and then use
+these codecs on them to be able to do chunk processing as well,
+e.g. to provide progress information to the user.
+
+class StreamWriter(Codec):
+
+    def __init__(self,stream,errors='strict'):
+
+        """ Creates a StreamWriter instance.
+
+            stream must be a file-like object open for writing
+            (binary) data.
+
+            The StreamWriter may implement different error handling
+            schemes by providing the errors keyword argument. These
+            parameters are defined:
+
+             'strict' - raise a ValueError (or a subclass)
+             'ignore' - ignore the character and continue with the next
+             'replace'- replace with a suitable replacement character
+
+        """
+        self.stream = stream
+        self.errors = errors
+
+    def write(self,object):
+
+        """ Writes the object's contents encoded to self.stream.
+        """
+        data, consumed = self.encode(object,self.errors)
+        self.stream.write(data)
+        
+    def reset(self):
+
+        """ Flushes and resets the codec buffers used for keeping state.
+
+            Calling this method should ensure that the data on the
+            output is put into a clean state, that allows appending
+            of new fresh data without having to rescan the whole
+            stream to recover state.
+
+        """
+        pass
+
+    def __getattr__(self,name,
+
+                    getattr=getattr):
+
+        """ Inherit all other methods from the underlying stream.
+        """
+        return getattr(self.stream,name)
+
+class StreamReader(Codec):
+
+    def __init__(self,stream,errors='strict'):
+
+        """ Creates a StreamReader instance.
+
+            stream must be a file-like object open for reading
+            (binary) data.
+
+            The StreamReader may implement different error handling
+            schemes by providing the errors keyword argument. These
+            parameters are defined:
+
+             'strict' - raise a ValueError (or a subclass)
+             'ignore' - ignore the character and continue with the next
+             'replace'- replace with a suitable replacement character;
+
+        """
+        self.stream = stream
+        self.errors = errors
+
+    def read(self,size=-1):
+
+        """ Decodes data from the stream self.stream and returns the
+            resulting object.
+
+            size indicates the approximate maximum number of bytes to
+            read from the stream for decoding purposes. The decoder
+            can modify this setting as appropriate. The default value
+            -1 indicates to read and decode as much as possible.  size
+            is intended to prevent having to decode huge files in one
+            step.
+
+            The method should use a greedy read strategy meaning that
+            it should read as much data as is allowed within the
+            definition of the encoding and the given size, e.g.  if
+            optional encoding endings or state markers are available
+            on the stream, these should be read too.
+
+        """
+        # Unsliced reading:
+        if size < 0:
+            return self.decode(self.stream.read())[0]
+        
+        # Sliced reading:
+        read = self.stream.read
+        decode = self.decode
+        data = read(size)
+        i = 0
+        while 1:
+            try:
+                object, decodedbytes = decode(data)
+            except ValueError,why:
+                # This method is slow but should work under pretty much
+                # all conditions; at most 10 tries are made
+                i = i + 1
+                newdata = read(1)
+                if not newdata or i > 10:
+                    raise
+                data = data + newdata
+            else:
+                return object
+
+    def reset(self):
+
+        """ Resets the codec buffers used for keeping state.
+
+            Note that no stream repositioning should take place.
+            This method is primarely intended to be able to recover
+            from decoding errors.
+
+        """
+        pass
+
+    def __getattr__(self,name,
+
+                    getattr=getattr):
+
+        """ Inherit all other methods from the underlying stream.
+        """
+        return getattr(self.stream,name)
+
+XXX What about .readline(), .readlines() ? These could be implemented
+    using .read() as generic functions instead of requiring their
+    implementation by all codecs. Also see Line Breaks.
+
+Stream codec implementors are free to combine the StreamWriter and
+StreamReader interfaces into one class. Even combining all these with
+the Codec class should be possible.
+
+Implementors are free to add additional methods to enhance the codec
+functionality or provide extra state information needed for them to
+work. The internal codec implementation will only use the above
+interfaces, though.
+
+It is not required by the Unicode implementation to use these base
+classes, only the interfaces must match; this allows writing Codecs as
+extensions types.
+
+As guideline, large mapping tables should be implemented using static
+C data in separate (shared) extension modules. That way multiple
+processes can share the same data.
+
+A tool to auto-convert Unicode mapping files to mapping modules should be
+provided to simplify support for additional mappings (see References).
+
+
+Whitespace:
+-----------
+
+The .split() method will have to know about what is considered
+whitespace in Unicode.
+
+
+Case Conversion:
+----------------
+
+Case conversion is rather complicated with Unicode data, since there
+are many different conditions to respect. See
+
+  http://www.unicode.org/unicode/reports/tr13/ 
+
+for some guidelines on implementing case conversion.
+
+For Python, we should only implement the 1-1 conversions included in
+Unicode. Locale dependent and other special case conversions (see the
+Unicode standard file SpecialCasing.txt) should be left to user land
+routines and not go into the core interpreter.
+
+The methods .capitalize() and .iscapitalized() should follow the case
+mapping algorithm defined in the above technical report as closely as
+possible.
+
+
+Line Breaks:
+------------
+
+Line breaking should be done for all Unicode characters having the B
+property as well as the combinations CRLF, CR, LF (interpreted in that
+order) and other special line separators defined by the standard.
+
+The Unicode type should provide a .splitlines() method which returns a
+list of lines according to the above specification. See Unicode
+Methods.
+
+
+Unicode Character Properties:
+-----------------------------
+
+A separate module "unicodedata" should provide a compact interface to
+all Unicode character properties defined in the standard's
+UnicodeData.txt file.
+
+Among other things, these properties provide ways to recognize
+numbers, digits, spaces, whitespace, etc.
+
+Since this module will have to provide access to all Unicode
+characters, it will eventually have to contain the data from
+UnicodeData.txt which takes up around 600kB. For this reason, the data
+should be stored in static C data. This enables compilation as shared
+module which the underlying OS can shared between processes (unlike
+normal Python code modules).
+
+There should be a standard Python interface for accessing this information
+so that other implementors can plug in their own possibly enhanced versions,
+e.g. ones that do decompressing of the data on-the-fly.
+
+
+Private Code Point Areas:
+-------------------------
+
+Support for these is left to user land Codecs and not explicitly
+intergrated into the core. Note that due to the Internal Format being
+implemented, only the area between \uE000 and \uF8FF is useable for
+private encodings.
+
+
+Internal Format:
+----------------
+
+The internal format for Unicode objects should use a Python specific
+fixed format <PythonUnicode> implemented as 'unsigned short' (or
+another unsigned numeric type having 16 bits). Byte order is platform
+dependent.
+
+This format will hold UTF-16 encodings of the corresponding Unicode
+ordinals. The Python Unicode implementation will address these values
+as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
+currently defined Unicode character points. UTF-16 without surrogates
+provides access to about 64k characters and covers all characters in
+the Basic Multilingual Plane (BMP) of Unicode.
+
+It is the Codec's responsibility to ensure that the data they pass to
+the Unicode object constructor repects this assumption. The
+constructor does not check the data for Unicode compliance or use of
+surrogates.
+
+Future implementations can extend the 32 bit restriction to the full
+set of all UTF-16 addressable characters (around 1M characters).
+
+The Unicode API should provide inteface routines from <PythonUnicode>
+to the compiler's wchar_t which can be 16 or 32 bit depending on the
+compiler/libc/platform being used.
+
+Unicode objects should have a pointer to a cached Python string object
+<defencstr> holding the object's value using the current <default
+encoding>.  This is needed for performance and internal parsing (see
+Internal Argument Parsing) reasons. The buffer is filled when the
+first conversion request to the <default encoding> is issued on the
+object.
+
+Interning is not needed (for now), since Python identifiers are
+defined as being ASCII only.
+
+codecs.BOM should return the byte order mark (BOM) for the format
+used internally. The codecs module should provide the following
+additional constants for convenience and reference (codecs.BOM will
+either be BOM_BE or BOM_LE depending on the platform):
+
+  BOM_BE: '\376\377' 
+    (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
+     platforms == ZERO WIDTH NO-BREAK SPACE)
+
+  BOM_LE: '\377\376' 
+    (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
+     platforms == defined as being an illegal Unicode character)
+
+  BOM4_BE: '\000\000\376\377'
+    (corresponds to Unicode U+0000FEFF in UCS-4)
+
+  BOM4_LE: '\377\376\000\000'
+    (corresponds to Unicode U+0000FFFE in UCS-4)
+
+Note that Unicode sees big endian byte order as being "correct". The
+swapped order is taken to be an indicator for a "wrong" format, hence
+the illegal character definition.
+
+The configure script should provide aid in deciding whether Python can
+use the native wchar_t type or not (it has to be a 16-bit unsigned
+type).
+
+
+Buffer Interface:
+-----------------
+
+Implement the buffer interface using the <defencstr> Python string
+object as basis for bf_getcharbuf (corresponds to the "t#" argument
+parsing marker) and the internal buffer for bf_getreadbuf (corresponds
+to the "s#" argument parsing marker). If bf_getcharbuf is requested
+and the <defencstr> object does not yet exist, it is created first.
+
+This has the advantage of being able to write to output streams (which
+typically use this interface) without additional specification of the
+encoding to use.
+
+The internal format can also be accessed using the 'unicode-internal'
+codec, e.g. via u.encode('unicode-internal').
+
+
+Pickle/Marshalling:
+-------------------
+
+Should have native Unicode object support. The objects should be
+encoded using platform independent encodings.
+
+Marshal should use UTF-8 and Pickle should either choose
+Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
+encoding. Using UTF-8 instead of UTF-16 has the advantage of
+eliminating the need to store a BOM mark.
+
+
+Regular Expressions:
+--------------------
+
+Secret Labs AB is working on a Unicode-aware regular expression
+machinery.  It works on plain 8-bit, UCS-2, and (optionally) UCS-4
+internal character buffers.
+
+Also see
+
+        http://www.unicode.org/unicode/reports/tr18/
+
+for some remarks on how to treat Unicode REs.
+
+
+Formatting Markers:
+-------------------
+
+Format markers are used in Python format strings. If Python strings
+are used as format strings, the following interpretations should be in
+effect:
+
+  '%s':                 '%s' does str(u) for Unicode objects embedded
+                        in Python strings, so the output will be
+                        u.encode(<default encoding>)
+
+In case the format string is an Unicode object, all parameters are coerced
+to Unicode first and then put together and formatted according to the format
+string. Numbers are first converted to strings and then to Unicode.
+
+  '%s':			Python strings are interpreted as Unicode
+			string using the <default encoding>. Unicode
+			objects are taken as is.
+
+All other string formatters should work accordingly.
+
+Example:
+
+u"%s %s" % (u"abc", "abc")  ==  u"abc abc"
+
+
+Internal Argument Parsing:
+--------------------------
+
+These markers are used by the PyArg_ParseTuple() APIs:
+
+  'U':  Check for Unicode object and return a pointer to it
+
+  's':  For Unicode objects: auto convert them to the <default encoding>
+        and return a pointer to the object's <defencstr> buffer.
+
+  's#': Access to the Unicode object via the bf_getreadbuf buffer interface 
+        (see Buffer Interface); note that the length relates to the buffer
+        length, not the Unicode string length (this may be different
+        depending on the Internal Format).
+
+  't#': Access to the Unicode object via the bf_getcharbuf buffer interface
+        (see Buffer Interface); note that the length relates to the buffer
+        length, not necessarily to the Unicode string length (this may
+        be different depending on the <default encoding>).
+
+
+File/Stream Output:
+-------------------
+
+Since file.write(object) and most other stream writers use the "s#"
+argument parsing marker for binary files and "t#" for text files, the
+buffer interface implementation determines the encoding to use (see
+Buffer Interface).
+
+For explicit handling of files using Unicode, the standard
+stream codecs as available through the codecs module should 
+be used.
+
+XXX There should be a short-cut open(filename,mode,encoding) available which
+    also assures that mode contains the 'b' character when needed.
+
+
+File/Stream Input:
+------------------
+
+Only the user knows what encoding the input data uses, so no special
+magic is applied. The user will have to explicitly convert the string
+data to Unicode objects as needed or use the file wrappers defined in
+the codecs module (see File/Stream Output).
+
+
+Unicode Methods & Attributes:
+-----------------------------
+
+All Python string methods, plus:
+
+  .encode([encoding=<default encoding>][,errors="strict"]) 
+     --> see Unicode Output
+
+  .splitlines([include_breaks=0])
+     --> breaks the Unicode string into a list of (Unicode) lines;
+         returns the lines with line breaks included, if include_breaks
+         is true. See Line Breaks for a specification of how line breaking
+         is done.
+
+
+Code Base:
+----------
+
+We should use Fredrik Lundh's Unicode object implementation as basis.
+It already implements most of the string methods needed and provides a
+well written code base which we can build upon.
+
+The object sharing implemented in Fredrik's implementation should
+be dropped.
+
+
+Test Cases:
+-----------
+
+Test cases should follow those in Lib/test/test_string.py and include
+additional checks for the Codec Registry and the Standard Codecs.
+
+
+References:
+-----------
+
+Unicode Consortium:
+        http://www.unicode.org/
+
+Unicode FAQ:
+        http://www.unicode.org/unicode/faq/
+
+Unicode 3.0:
+        http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
+
+Unicode-TechReports:
+        http://www.unicode.org/unicode/reports/techreports.html
+
+Unicode-Mappings:
+        ftp://ftp.unicode.org/Public/MAPPINGS/
+
+Introduction to Unicode (a little outdated by still nice to read):
+        http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
+
+Encodings:
+
+    Overview:
+            http://czyborra.com/utf/
+
+    UTC-2:
+            http://www.uazone.com/multiling/unicode/ucs2.html
+
+    UTF-7:
+            Defined in RFC2152, e.g.
+            http://www.uazone.com/multiling/ml-docs/rfc2152.txt
+
+    UTF-8:
+            Defined in RFC2279, e.g.
+            http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
+
+    UTF-16:
+            http://www.uazone.com/multiling/unicode/wg2n1035.html
+
+
+History of this Proposal:
+-------------------------
+1.2: 
+1.1: Added note about comparisons and hash values. Added note about
+     case mapping algorithms. Changed stream codecs .read() and
+     .write() method to match the standard file-like object methods
+     (bytes consumed information is no longer returned by the methods)
+1.0: changed encode Codec method to be symmetric to the decode method
+     (they both return (object, data consumed) now and thus become
+     interchangeable); removed __init__ method of Codec class (the
+     methods are stateless) and moved the errors argument down to the
+     methods; made the Codec design more generic w/r to type of input
+     and output objects; changed StreamWriter.flush to StreamWriter.reset
+     in order to avoid overriding the stream's .flush() method;
+     renamed .breaklines() to .splitlines(); renamed the module unicodec
+     to codecs; modified the File I/O section to refer to the stream codecs.
+0.9: changed errors keyword argument definition; added 'replace' error
+     handling; changed the codec APIs to accept buffer like objects on
+     input; some minor typo fixes; added Whitespace section and
+     included references for Unicode characters that have the whitespace
+     and the line break characteristic; added note that search functions
+     can expect lower-case encoding names; dropped slicing and offsets
+     in the codec APIs
+0.8: added encodings package and raw unicode escape encoding; untabified
+     the proposal; added notes on Unicode format strings; added
+     .breaklines() method
+0.7: added a whole new set of codec APIs; added a different encoder
+     lookup scheme; fixed some names
+0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
+     a real Python string object; changed Buffer Interface to delegate
+     requests to <defencstr>'s buffer interface; removed the explicit
+     reference to the unicodec.codecs dictionary (the module can implement
+     this in way fit for the purpose); removed the settable default
+     encoding; move UnicodeError from unicodec to exceptions; "s#"
+     not returns the internal data; passed the UCS-2/UTF-16 checking
+     from the Unicode constructor to the Codecs
+0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
+     private use encodings and Unicode character properties
+0.4: added Codec interface, notes on %-formatting, changed some encoding
+     details, added comments on stream wrappers, fixed some discussion
+     points (most important: Internal Format), clarified the 
+     'unicode-escape' encoding, added encoding references
+0.3: added references, comments on codec modules, the internal format,
+     bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
+     proposed by Tim Peters and fixed repr(u) accordingly
+0.2: integrated Guido's suggestions, added stream codecs and file
+     wrapping
+0.1: first version
+
+
+-----------------------------------------------------------------------------
+Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
+-----------------------------------------------------------------------------
author	Guido van Rossum <guido@python.org>	2000-03-10 23:14:11 (GMT)
committer	Guido van Rossum <guido@python.org>	2000-03-10 23:14:11 (GMT)
commit	9ed0d1ef18321f8939cd899276bba27cb61e5c3a (patch)
tree	1d26cde56f6ff67d6c126d7628e08712dcd9d8c6 /Misc
parent	e141fd84e96abf8eb509e7c4d5503fb5cd972758 (diff)
download	cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.zip cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.gz cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.bz2