diff options
author | Guido van Rossum <guido@python.org> | 2000-03-10 23:14:11 (GMT) |
---|---|---|
committer | Guido van Rossum <guido@python.org> | 2000-03-10 23:14:11 (GMT) |
commit | 9ed0d1ef18321f8939cd899276bba27cb61e5c3a (patch) | |
tree | 1d26cde56f6ff67d6c126d7628e08712dcd9d8c6 /Misc | |
parent | e141fd84e96abf8eb509e7c4d5503fb5cd972758 (diff) | |
download | cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.zip cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.gz cpython-9ed0d1ef18321f8939cd899276bba27cb61e5c3a.tar.bz2 |
Marc-Andre Lemburg: Python Unicode integration proposal, version 1.2.
Diffstat (limited to 'Misc')
-rw-r--r-- | Misc/unicode.txt | 885 |
1 files changed, 885 insertions, 0 deletions
diff --git a/Misc/unicode.txt b/Misc/unicode.txt new file mode 100644 index 0000000..b31beef --- /dev/null +++ b/Misc/unicode.txt @@ -0,0 +1,885 @@ +============================================================================= + Python Unicode Integration Proposal Version: 1.2 +----------------------------------------------------------------------------- + + +Introduction: +------------- + +The idea of this proposal is to add native Unicode 3.0 support to +Python in a way that makes use of Unicode strings as simple as +possible without introducing too many pitfalls along the way. + +Since this goal is not easy to achieve -- strings being one of the +most fundamental objects in Python --, we expect this proposal to +undergo some significant refinements. + +Note that the current version of this proposal is still a bit unsorted +due to the many different aspects of the Unicode-Python integration. + +The latest version of this document is always available at: + + http://starship.skyport.net/~lemburg/unicode-proposal.txt + +Older versions are available as: + + http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt + + +Conventions: +------------ + +· In examples we use u = Unicode object and s = Python string + +· 'XXX' markings indicate points of discussion (PODs) + + +General Remarks: +---------------- + +· Unicode encoding names should be lower case on output and + case-insensitive on input (they will be converted to lower case + by all APIs taking an encoding name as input). + + Encoding names should follow the name conventions as used by the + Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is + written as 'utf-16'. + + Codec modules should use the same names, but with hyphens converted + to underscores, e.g. utf_8, utf_16, iso_8859_1. + +· The <default encoding> should be the widely used 'utf-8' format. This + is very close to the standard 7-bit ASCII format and thus resembles the + standard used programming nowadays in most aspects. + + +Unicode Constructors: +--------------------- + +Python should provide a built-in constructor for Unicode strings which +is available through __builtins__: + + u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"]) + + u = u'<unicode-escape encoded Python string>' + + u = ur'<raw-unicode-escape encoded Python string>' + +With the 'unicode-escape' encoding being defined as: + +· all non-escape characters represent themselves as Unicode ordinal + (e.g. 'a' -> U+0061). + +· all existing defined Python escape sequences are interpreted as + Unicode ordinals; note that \xXXXX can represent all Unicode + ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. + +· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax + error to have fewer than 4 digits after \u. + +For an explanation of possible values for errors see the Codec section +below. + +Examples: + +u'abc' -> U+0061 U+0062 U+0063 +u'\u1234' -> U+1234 +u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c + +The 'raw-unicode-escape' encoding is defined as follows: + +· \uXXXX sequence represent the U+XXXX Unicode character if and + only if the number of leading backslashes is odd + +· all other characters represent themselves as Unicode ordinal + (e.g. 'b' -> U+0062) + + +Note that you should provide some hint to the encoding you used to +write your programs as pragma line in one the first few comment lines +of the source file (e.g. '# source file encoding: latin-1'). If you +only use 7-bit ASCII then everything is fine and no such notice is +needed, but if you include Latin-1 characters not defined in ASCII, it +may well be worthwhile including a hint since people in other +countries will want to be able to read you source strings too. + + +Unicode Type Object: +-------------------- + +Unicode objects should have the type UnicodeType with type name +'unicode', made available through the standard types module. + + +Unicode Output: +--------------- + +Unicode objects have a method .encode([encoding=<default encoding>]) +which returns a Python string encoding the Unicode string using the +given scheme (see Codecs). + + print u := print u.encode() # using the <default encoding> + + str(u) := u.encode() # using the <default encoding> + + repr(u) := "u%s" % repr(u.encode('unicode-escape')) + +Also see Internal Argument Parsing and Buffer Interface for details on +how other APIs written in C will treat Unicode objects. + + +Unicode Ordinals: +----------------- + +Since Unicode 3.0 has a 32-bit ordinal character set, the implementation +should provide 32-bit aware ordinal conversion APIs: + + ord(u[:1]) (this is the standard ord() extended to work with Unicode + objects) + --> Unicode ordinal number (32-bit) + + unichr(i) + --> Unicode object for character i (provided it is 32-bit); + ValueError otherwise + +Both APIs should go into __builtins__ just like their string +counterparts ord() and chr(). + +Note that Unicode provides space for private encodings. Usage of these +can cause different output representations on different machines. This +problem is not a Python or Unicode problem, but a machine setup and +maintenance one. + + +Comparison & Hash Value: +------------------------ + +Unicode objects should compare equal to other objects after these +other objects have been coerced to Unicode. For strings this means +that they are interpreted as Unicode string using the <default +encoding>. + +For the same reason, Unicode objects should return the same hash value +as their UTF-8 equivalent strings. + +Coercion: +--------- + +Using Python strings and Unicode objects to form new objects should +always coerce to the more precise format, i.e. Unicode objects. + + u + s := u + unicode(s) + + s + u := unicode(s) + u + +All string methods should delegate the call to an equivalent Unicode +object method call by converting all envolved strings to Unicode and +then applying the arguments to the Unicode method of the same name, +e.g. + + string.join((s,u),sep) := (s + sep) + u + + sep.join((s,u)) := (s + sep) + u + +For a discussion of %-formatting w/r to Unicode objects, see +Formatting Markers. + + +Exceptions: +----------- + +UnicodeError is defined in the exceptions module as subclass of +ValueError. It is available at the C level via PyExc_UnicodeError. +All exceptions related to Unicode encoding/decoding should be +subclasses of UnicodeError. + + +Codecs (Coder/Decoders) Lookup: +------------------------------- + +A Codec (see Codec Interface Definition) search registry should be +implemented by a module "codecs": + + codecs.register(search_function) + +Search functions are expected to take one argument, the encoding name +in all lower case letters, and return a tuple of functions (encoder, +decoder, stream_reader, stream_writer) taking the following arguments: + + encoder and decoder: + These must be functions or methods which have the same + interface as the .encode/.decode methods of Codec instances + (see Codec Interface). The functions/methods are expected to + work in a stateless mode. + + stream_reader and stream_writer: + These need to be factory functions with the following + interface: + + factory(stream,errors='strict') + + The factory functions must return objects providing + the interfaces defined by StreamWriter/StreamReader resp. + (see Codec Interface). Stream codecs can maintain state. + + Possible values for errors are defined in the Codec + section below. + +In case a search function cannot find a given encoding, it should +return None. + +Aliasing support for encodings is left to the search functions +to implement. + +The codecs module will maintain an encoding cache for performance +reasons. Encodings are first looked up in the cache. If not found, the +list of registered search functions is scanned. If no codecs tuple is +found, a LookupError is raised. Otherwise, the codecs tuple is stored +in the cache and returned to the caller. + +To query the Codec instance the following API should be used: + + codecs.lookup(encoding) + +This will either return the found codecs tuple or raise a LookupError. + + +Standard Codecs: +---------------- + +Standard codecs should live inside an encodings/ package directory in the +Standard Python Code Library. The __init__.py file of that directory should +include a Codec Lookup compatible search function implementing a lazy module +based codec lookup. + +Python should provide a few standard codecs for the most relevant +encodings, e.g. + + 'utf-8': 8-bit variable length encoding + 'utf-16': 16-bit variable length encoding (litte/big endian) + 'utf-16-le': utf-16 but explicitly little endian + 'utf-16-be': utf-16 but explicitly big endian + 'ascii': 7-bit ASCII codepage + 'iso-8859-1': ISO 8859-1 (Latin 1) codepage + 'unicode-escape': See Unicode Constructors for a definition + 'raw-unicode-escape': See Unicode Constructors for a definition + 'native': Dump of the Internal Format used by Python + +Common aliases should also be provided per default, e.g. 'latin-1' +for 'iso-8859-1'. + +Note: 'utf-16' should be implemented by using and requiring byte order +marks (BOM) for file input/output. + +All other encodings such as the CJK ones to support Asian scripts +should be implemented in seperate packages which do not get included +in the core Python distribution and are not a part of this proposal. + + +Codecs Interface Definition: +---------------------------- + +The following base class should be defined in the module +"codecs". They provide not only templates for use by encoding module +implementors, but also define the interface which is expected by the +Unicode implementation. + +Note that the Codec Interface defined here is well suitable for a +larger range of applications. The Unicode implementation expects +Unicode objects on input for .encode() and .write() and character +buffer compatible objects on input for .decode(). Output of .encode() +and .read() should be a Python string and .decode() must return an +Unicode object. + +First, we have the stateless encoders/decoders. These do not work in +chunks as the stream codecs (see below) do, because all components are +expected to be available in memory. + +class Codec: + + """ Defines the interface for stateless encoders/decoders. + + The .encode()/.decode() methods may implement different error + handling schemes by providing the errors argument. These + string values are defined: + + 'strict' - raise an error (or a subclass) + 'ignore' - ignore the character and continue with the next + 'replace' - replace with a suitable replacement character; + Python will use the official U+FFFD REPLACEMENT + CHARACTER for the builtin Unicode codecs. + + """ + def encode(self,input,errors='strict'): + + """ Encodes the object intput and returns a tuple (output + object, length consumed). + + errors defines the error handling to apply. It defaults to + 'strict' handling. + + The method may not store state in the Codec instance. Use + SteamCodec for codecs which have to keep state in order to + make encoding/decoding efficient. + + """ + ... + + def decode(self,input,errors='strict'): + + """ Decodes the object input and returns a tuple (output + object, length consumed). + + input must be an object which provides the bf_getreadbuf + buffer slot. Python strings, buffer objects and memory + mapped files are examples of objects providing this slot. + + errors defines the error handling to apply. It defaults to + 'strict' handling. + + The method may not store state in the Codec instance. Use + SteamCodec for codecs which have to keep state in order to + make encoding/decoding efficient. + + """ + ... + +StreamWriter and StreamReader define the interface for stateful +encoders/decoders which work on streams. These allow processing of the +data in chunks to efficiently use memory. If you have large strings in +memory, you may want to wrap them with cStringIO objects and then use +these codecs on them to be able to do chunk processing as well, +e.g. to provide progress information to the user. + +class StreamWriter(Codec): + + def __init__(self,stream,errors='strict'): + + """ Creates a StreamWriter instance. + + stream must be a file-like object open for writing + (binary) data. + + The StreamWriter may implement different error handling + schemes by providing the errors keyword argument. These + parameters are defined: + + 'strict' - raise a ValueError (or a subclass) + 'ignore' - ignore the character and continue with the next + 'replace'- replace with a suitable replacement character + + """ + self.stream = stream + self.errors = errors + + def write(self,object): + + """ Writes the object's contents encoded to self.stream. + """ + data, consumed = self.encode(object,self.errors) + self.stream.write(data) + + def reset(self): + + """ Flushes and resets the codec buffers used for keeping state. + + Calling this method should ensure that the data on the + output is put into a clean state, that allows appending + of new fresh data without having to rescan the whole + stream to recover state. + + """ + pass + + def __getattr__(self,name, + + getattr=getattr): + + """ Inherit all other methods from the underlying stream. + """ + return getattr(self.stream,name) + +class StreamReader(Codec): + + def __init__(self,stream,errors='strict'): + + """ Creates a StreamReader instance. + + stream must be a file-like object open for reading + (binary) data. + + The StreamReader may implement different error handling + schemes by providing the errors keyword argument. These + parameters are defined: + + 'strict' - raise a ValueError (or a subclass) + 'ignore' - ignore the character and continue with the next + 'replace'- replace with a suitable replacement character; + + """ + self.stream = stream + self.errors = errors + + def read(self,size=-1): + + """ Decodes data from the stream self.stream and returns the + resulting object. + + size indicates the approximate maximum number of bytes to + read from the stream for decoding purposes. The decoder + can modify this setting as appropriate. The default value + -1 indicates to read and decode as much as possible. size + is intended to prevent having to decode huge files in one + step. + + The method should use a greedy read strategy meaning that + it should read as much data as is allowed within the + definition of the encoding and the given size, e.g. if + optional encoding endings or state markers are available + on the stream, these should be read too. + + """ + # Unsliced reading: + if size < 0: + return self.decode(self.stream.read())[0] + + # Sliced reading: + read = self.stream.read + decode = self.decode + data = read(size) + i = 0 + while 1: + try: + object, decodedbytes = decode(data) + except ValueError,why: + # This method is slow but should work under pretty much + # all conditions; at most 10 tries are made + i = i + 1 + newdata = read(1) + if not newdata or i > 10: + raise + data = data + newdata + else: + return object + + def reset(self): + + """ Resets the codec buffers used for keeping state. + + Note that no stream repositioning should take place. + This method is primarely intended to be able to recover + from decoding errors. + + """ + pass + + def __getattr__(self,name, + + getattr=getattr): + + """ Inherit all other methods from the underlying stream. + """ + return getattr(self.stream,name) + +XXX What about .readline(), .readlines() ? These could be implemented + using .read() as generic functions instead of requiring their + implementation by all codecs. Also see Line Breaks. + +Stream codec implementors are free to combine the StreamWriter and +StreamReader interfaces into one class. Even combining all these with +the Codec class should be possible. + +Implementors are free to add additional methods to enhance the codec +functionality or provide extra state information needed for them to +work. The internal codec implementation will only use the above +interfaces, though. + +It is not required by the Unicode implementation to use these base +classes, only the interfaces must match; this allows writing Codecs as +extensions types. + +As guideline, large mapping tables should be implemented using static +C data in separate (shared) extension modules. That way multiple +processes can share the same data. + +A tool to auto-convert Unicode mapping files to mapping modules should be +provided to simplify support for additional mappings (see References). + + +Whitespace: +----------- + +The .split() method will have to know about what is considered +whitespace in Unicode. + + +Case Conversion: +---------------- + +Case conversion is rather complicated with Unicode data, since there +are many different conditions to respect. See + + http://www.unicode.org/unicode/reports/tr13/ + +for some guidelines on implementing case conversion. + +For Python, we should only implement the 1-1 conversions included in +Unicode. Locale dependent and other special case conversions (see the +Unicode standard file SpecialCasing.txt) should be left to user land +routines and not go into the core interpreter. + +The methods .capitalize() and .iscapitalized() should follow the case +mapping algorithm defined in the above technical report as closely as +possible. + + +Line Breaks: +------------ + +Line breaking should be done for all Unicode characters having the B +property as well as the combinations CRLF, CR, LF (interpreted in that +order) and other special line separators defined by the standard. + +The Unicode type should provide a .splitlines() method which returns a +list of lines according to the above specification. See Unicode +Methods. + + +Unicode Character Properties: +----------------------------- + +A separate module "unicodedata" should provide a compact interface to +all Unicode character properties defined in the standard's +UnicodeData.txt file. + +Among other things, these properties provide ways to recognize +numbers, digits, spaces, whitespace, etc. + +Since this module will have to provide access to all Unicode +characters, it will eventually have to contain the data from +UnicodeData.txt which takes up around 600kB. For this reason, the data +should be stored in static C data. This enables compilation as shared +module which the underlying OS can shared between processes (unlike +normal Python code modules). + +There should be a standard Python interface for accessing this information +so that other implementors can plug in their own possibly enhanced versions, +e.g. ones that do decompressing of the data on-the-fly. + + +Private Code Point Areas: +------------------------- + +Support for these is left to user land Codecs and not explicitly +intergrated into the core. Note that due to the Internal Format being +implemented, only the area between \uE000 and \uF8FF is useable for +private encodings. + + +Internal Format: +---------------- + +The internal format for Unicode objects should use a Python specific +fixed format <PythonUnicode> implemented as 'unsigned short' (or +another unsigned numeric type having 16 bits). Byte order is platform +dependent. + +This format will hold UTF-16 encodings of the corresponding Unicode +ordinals. The Python Unicode implementation will address these values +as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all +currently defined Unicode character points. UTF-16 without surrogates +provides access to about 64k characters and covers all characters in +the Basic Multilingual Plane (BMP) of Unicode. + +It is the Codec's responsibility to ensure that the data they pass to +the Unicode object constructor repects this assumption. The +constructor does not check the data for Unicode compliance or use of +surrogates. + +Future implementations can extend the 32 bit restriction to the full +set of all UTF-16 addressable characters (around 1M characters). + +The Unicode API should provide inteface routines from <PythonUnicode> +to the compiler's wchar_t which can be 16 or 32 bit depending on the +compiler/libc/platform being used. + +Unicode objects should have a pointer to a cached Python string object +<defencstr> holding the object's value using the current <default +encoding>. This is needed for performance and internal parsing (see +Internal Argument Parsing) reasons. The buffer is filled when the +first conversion request to the <default encoding> is issued on the +object. + +Interning is not needed (for now), since Python identifiers are +defined as being ASCII only. + +codecs.BOM should return the byte order mark (BOM) for the format +used internally. The codecs module should provide the following +additional constants for convenience and reference (codecs.BOM will +either be BOM_BE or BOM_LE depending on the platform): + + BOM_BE: '\376\377' + (corresponds to Unicode U+0000FEFF in UTF-16 on big endian + platforms == ZERO WIDTH NO-BREAK SPACE) + + BOM_LE: '\377\376' + (corresponds to Unicode U+0000FFFE in UTF-16 on little endian + platforms == defined as being an illegal Unicode character) + + BOM4_BE: '\000\000\376\377' + (corresponds to Unicode U+0000FEFF in UCS-4) + + BOM4_LE: '\377\376\000\000' + (corresponds to Unicode U+0000FFFE in UCS-4) + +Note that Unicode sees big endian byte order as being "correct". The +swapped order is taken to be an indicator for a "wrong" format, hence +the illegal character definition. + +The configure script should provide aid in deciding whether Python can +use the native wchar_t type or not (it has to be a 16-bit unsigned +type). + + +Buffer Interface: +----------------- + +Implement the buffer interface using the <defencstr> Python string +object as basis for bf_getcharbuf (corresponds to the "t#" argument +parsing marker) and the internal buffer for bf_getreadbuf (corresponds +to the "s#" argument parsing marker). If bf_getcharbuf is requested +and the <defencstr> object does not yet exist, it is created first. + +This has the advantage of being able to write to output streams (which +typically use this interface) without additional specification of the +encoding to use. + +The internal format can also be accessed using the 'unicode-internal' +codec, e.g. via u.encode('unicode-internal'). + + +Pickle/Marshalling: +------------------- + +Should have native Unicode object support. The objects should be +encoded using platform independent encodings. + +Marshal should use UTF-8 and Pickle should either choose +Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as +encoding. Using UTF-8 instead of UTF-16 has the advantage of +eliminating the need to store a BOM mark. + + +Regular Expressions: +-------------------- + +Secret Labs AB is working on a Unicode-aware regular expression +machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 +internal character buffers. + +Also see + + http://www.unicode.org/unicode/reports/tr18/ + +for some remarks on how to treat Unicode REs. + + +Formatting Markers: +------------------- + +Format markers are used in Python format strings. If Python strings +are used as format strings, the following interpretations should be in +effect: + + '%s': '%s' does str(u) for Unicode objects embedded + in Python strings, so the output will be + u.encode(<default encoding>) + +In case the format string is an Unicode object, all parameters are coerced +to Unicode first and then put together and formatted according to the format +string. Numbers are first converted to strings and then to Unicode. + + '%s': Python strings are interpreted as Unicode + string using the <default encoding>. Unicode + objects are taken as is. + +All other string formatters should work accordingly. + +Example: + +u"%s %s" % (u"abc", "abc") == u"abc abc" + + +Internal Argument Parsing: +-------------------------- + +These markers are used by the PyArg_ParseTuple() APIs: + + 'U': Check for Unicode object and return a pointer to it + + 's': For Unicode objects: auto convert them to the <default encoding> + and return a pointer to the object's <defencstr> buffer. + + 's#': Access to the Unicode object via the bf_getreadbuf buffer interface + (see Buffer Interface); note that the length relates to the buffer + length, not the Unicode string length (this may be different + depending on the Internal Format). + + 't#': Access to the Unicode object via the bf_getcharbuf buffer interface + (see Buffer Interface); note that the length relates to the buffer + length, not necessarily to the Unicode string length (this may + be different depending on the <default encoding>). + + +File/Stream Output: +------------------- + +Since file.write(object) and most other stream writers use the "s#" +argument parsing marker for binary files and "t#" for text files, the +buffer interface implementation determines the encoding to use (see +Buffer Interface). + +For explicit handling of files using Unicode, the standard +stream codecs as available through the codecs module should +be used. + +XXX There should be a short-cut open(filename,mode,encoding) available which + also assures that mode contains the 'b' character when needed. + + +File/Stream Input: +------------------ + +Only the user knows what encoding the input data uses, so no special +magic is applied. The user will have to explicitly convert the string +data to Unicode objects as needed or use the file wrappers defined in +the codecs module (see File/Stream Output). + + +Unicode Methods & Attributes: +----------------------------- + +All Python string methods, plus: + + .encode([encoding=<default encoding>][,errors="strict"]) + --> see Unicode Output + + .splitlines([include_breaks=0]) + --> breaks the Unicode string into a list of (Unicode) lines; + returns the lines with line breaks included, if include_breaks + is true. See Line Breaks for a specification of how line breaking + is done. + + +Code Base: +---------- + +We should use Fredrik Lundh's Unicode object implementation as basis. +It already implements most of the string methods needed and provides a +well written code base which we can build upon. + +The object sharing implemented in Fredrik's implementation should +be dropped. + + +Test Cases: +----------- + +Test cases should follow those in Lib/test/test_string.py and include +additional checks for the Codec Registry and the Standard Codecs. + + +References: +----------- + +Unicode Consortium: + http://www.unicode.org/ + +Unicode FAQ: + http://www.unicode.org/unicode/faq/ + +Unicode 3.0: + http://www.unicode.org/unicode/standard/versions/Unicode3.0.html + +Unicode-TechReports: + http://www.unicode.org/unicode/reports/techreports.html + +Unicode-Mappings: + ftp://ftp.unicode.org/Public/MAPPINGS/ + +Introduction to Unicode (a little outdated by still nice to read): + http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html + +Encodings: + + Overview: + http://czyborra.com/utf/ + + UTC-2: + http://www.uazone.com/multiling/unicode/ucs2.html + + UTF-7: + Defined in RFC2152, e.g. + http://www.uazone.com/multiling/ml-docs/rfc2152.txt + + UTF-8: + Defined in RFC2279, e.g. + http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt + + UTF-16: + http://www.uazone.com/multiling/unicode/wg2n1035.html + + +History of this Proposal: +------------------------- +1.2: +1.1: Added note about comparisons and hash values. Added note about + case mapping algorithms. Changed stream codecs .read() and + .write() method to match the standard file-like object methods + (bytes consumed information is no longer returned by the methods) +1.0: changed encode Codec method to be symmetric to the decode method + (they both return (object, data consumed) now and thus become + interchangeable); removed __init__ method of Codec class (the + methods are stateless) and moved the errors argument down to the + methods; made the Codec design more generic w/r to type of input + and output objects; changed StreamWriter.flush to StreamWriter.reset + in order to avoid overriding the stream's .flush() method; + renamed .breaklines() to .splitlines(); renamed the module unicodec + to codecs; modified the File I/O section to refer to the stream codecs. +0.9: changed errors keyword argument definition; added 'replace' error + handling; changed the codec APIs to accept buffer like objects on + input; some minor typo fixes; added Whitespace section and + included references for Unicode characters that have the whitespace + and the line break characteristic; added note that search functions + can expect lower-case encoding names; dropped slicing and offsets + in the codec APIs +0.8: added encodings package and raw unicode escape encoding; untabified + the proposal; added notes on Unicode format strings; added + .breaklines() method +0.7: added a whole new set of codec APIs; added a different encoder + lookup scheme; fixed some names +0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding + a real Python string object; changed Buffer Interface to delegate + requests to <defencstr>'s buffer interface; removed the explicit + reference to the unicodec.codecs dictionary (the module can implement + this in way fit for the purpose); removed the settable default + encoding; move UnicodeError from unicodec to exceptions; "s#" + not returns the internal data; passed the UCS-2/UTF-16 checking + from the Unicode constructor to the Codecs +0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, + private use encodings and Unicode character properties +0.4: added Codec interface, notes on %-formatting, changed some encoding + details, added comments on stream wrappers, fixed some discussion + points (most important: Internal Format), clarified the + 'unicode-escape' encoding, added encoding references +0.3: added references, comments on codec modules, the internal format, + bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding + proposed by Tim Peters and fixed repr(u) accordingly +0.2: integrated Guido's suggestions, added stream codecs and file + wrapping +0.1: first version + + +----------------------------------------------------------------------------- +Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com +----------------------------------------------------------------------------- |