diff options
-rw-r--r-- | Misc/unicode.txt | 1116 |
1 files changed, 3 insertions, 1113 deletions
diff --git a/Misc/unicode.txt b/Misc/unicode.txt index b71e4ca..a252ebe 100644 --- a/Misc/unicode.txt +++ b/Misc/unicode.txt @@ -1,1115 +1,5 @@ -============================================================================= - Python Unicode Integration Proposal Version: 1.7 ------------------------------------------------------------------------------ +This document has been PEP-ified. Please see PEP 100 at: + http://www.python.org/peps/pep-0100.html -Introduction: -------------- - -The idea of this proposal is to add native Unicode 3.0 support to -Python in a way that makes use of Unicode strings as simple as -possible without introducing too many pitfalls along the way. - -Since this goal is not easy to achieve -- strings being one of the -most fundamental objects in Python --, we expect this proposal to -undergo some significant refinements. - -Note that the current version of this proposal is still a bit unsorted -due to the many different aspects of the Unicode-Python integration. - -The latest version of this document is always available at: - - http://starship.python.net/~lemburg/unicode-proposal.txt - -Older versions are available as: - - http://starship.python.net/~lemburg/unicode-proposal-X.X.txt - - -Conventions: ------------- - -· In examples we use u = Unicode object and s = Python string - -· 'XXX' markings indicate points of discussion (PODs) - - -General Remarks: ----------------- - -· Unicode encoding names should be lower case on output and - case-insensitive on input (they will be converted to lower case - by all APIs taking an encoding name as input). - -· Encoding names should follow the name conventions as used by the - Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is - written as 'utf-16'. - -· Codec modules should use the same names, but with hyphens converted - to underscores, e.g. utf_8, utf_16, iso_8859_1. - - -Unicode Default Encoding: -------------------------- - -The Unicode implementation has to make some assumption about the -encoding of 8-bit strings passed to it for coercion and about the -encoding to as default for conversion of Unicode to strings when no -specific encoding is given. This encoding is called <default encoding> -throughout this text. - -For this, the implementation maintains a global which can be set in -the site.py Python startup script. Subsequent changes are not -possible. The <default encoding> can be set and queried using the -two sys module APIs: - - sys.setdefaultencoding(encoding) - --> Sets the <default encoding> used by the Unicode implementation. - encoding has to be an encoding which is supported by the Python - installation, otherwise, a LookupError is raised. - - Note: This API is only available in site.py ! It is removed - from the sys module by site.py after usage. - - sys.getdefaultencoding() - --> Returns the current <default encoding>. - -If not otherwise defined or set, the <default encoding> defaults to -'ascii'. This encoding is also the startup default of Python (and in -effect before site.py is executed). - -Note that the default site.py startup module contains disabled -optional code which can set the <default encoding> according to the -encoding defined by the current locale. The locale module is used to -extract the encoding from the locale default settings defined by the -OS environment (see locale.py). If the encoding cannot be determined, -is unkown or unsupported, the code defaults to setting the <default -encoding> to 'ascii'. To enable this code, edit the site.py file or -place the appropriate code into the sitecustomize.py module of your -Python installation. - - -Unicode Constructors: ---------------------- - -Python should provide a built-in constructor for Unicode strings which -is available through __builtins__: - - u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"]) - - u = u'<unicode-escape encoded Python string>' - - u = ur'<raw-unicode-escape encoded Python string>' - -With the 'unicode-escape' encoding being defined as: - -· all non-escape characters represent themselves as Unicode ordinal - (e.g. 'a' -> U+0061). - -· all existing defined Python escape sequences are interpreted as - Unicode ordinals; note that \xXXXX can represent all Unicode - ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. - -· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax - error to have fewer than 4 digits after \u. - -For an explanation of possible values for errors see the Codec section -below. - -Examples: - -u'abc' -> U+0061 U+0062 U+0063 -u'\u1234' -> U+1234 -u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c - -The 'raw-unicode-escape' encoding is defined as follows: - -· \uXXXX sequence represent the U+XXXX Unicode character if and - only if the number of leading backslashes is odd - -· all other characters represent themselves as Unicode ordinal - (e.g. 'b' -> U+0062) - - -Note that you should provide some hint to the encoding you used to -write your programs as pragma line in one the first few comment lines -of the source file (e.g. '# source file encoding: latin-1'). If you -only use 7-bit ASCII then everything is fine and no such notice is -needed, but if you include Latin-1 characters not defined in ASCII, it -may well be worthwhile including a hint since people in other -countries will want to be able to read your source strings too. - - -Unicode Type Object: --------------------- - -Unicode objects should have the type UnicodeType with type name -'unicode', made available through the standard types module. - - -Unicode Output: ---------------- - -Unicode objects have a method .encode([encoding=<default encoding>]) -which returns a Python string encoding the Unicode string using the -given scheme (see Codecs). - - print u := print u.encode() # using the <default encoding> - - str(u) := u.encode() # using the <default encoding> - - repr(u) := "u%s" % repr(u.encode('unicode-escape')) - -Also see Internal Argument Parsing and Buffer Interface for details on -how other APIs written in C will treat Unicode objects. - - -Unicode Ordinals: ------------------ - -Since Unicode 3.0 has a 32-bit ordinal character set, the implementation -should provide 32-bit aware ordinal conversion APIs: - - ord(u[:1]) (this is the standard ord() extended to work with Unicode - objects) - --> Unicode ordinal number (32-bit) - - unichr(i) - --> Unicode object for character i (provided it is 32-bit); - ValueError otherwise - -Both APIs should go into __builtins__ just like their string -counterparts ord() and chr(). - -Note that Unicode provides space for private encodings. Usage of these -can cause different output representations on different machines. This -problem is not a Python or Unicode problem, but a machine setup and -maintenance one. - - -Comparison & Hash Value: ------------------------- - -Unicode objects should compare equal to other objects after these -other objects have been coerced to Unicode. For strings this means -that they are interpreted as Unicode string using the <default -encoding>. - -Unicode objects should return the same hash value as their ASCII -equivalent strings. Unicode strings holding non-ASCII values are not -guaranteed to return the same hash values as the default encoded -equivalent string representation. - -When compared using cmp() (or PyObject_Compare()) the implementation -should mask TypeErrors raised during the conversion to remain in synch -with the string behavior. All other errors such as ValueErrors raised -during coercion of strings to Unicode should not be masked and passed -through to the user. - -In containment tests ('a' in u'abc' and u'a' in 'abc') both sides -should be coerced to Unicode before applying the test. Errors occurring -during coercion (e.g. None in u'abc') should not be masked. - - -Coercion: ---------- - -Using Python strings and Unicode objects to form new objects should -always coerce to the more precise format, i.e. Unicode objects. - - u + s := u + unicode(s) - - s + u := unicode(s) + u - -All string methods should delegate the call to an equivalent Unicode -object method call by converting all involved strings to Unicode and -then applying the arguments to the Unicode method of the same name, -e.g. - - string.join((s,u),sep) := (s + sep) + u - - sep.join((s,u)) := (s + sep) + u - -For a discussion of %-formatting w/r to Unicode objects, see -Formatting Markers. - - -Exceptions: ------------ - -UnicodeError is defined in the exceptions module as a subclass of -ValueError. It is available at the C level via PyExc_UnicodeError. -All exceptions related to Unicode encoding/decoding should be -subclasses of UnicodeError. - - -Codecs (Coder/Decoders) Lookup: -------------------------------- - -A Codec (see Codec Interface Definition) search registry should be -implemented by a module "codecs": - - codecs.register(search_function) - -Search functions are expected to take one argument, the encoding name -in all lower case letters and with hyphens and spaces converted to -underscores, and return a tuple of functions (encoder, decoder, -stream_reader, stream_writer) taking the following arguments: - - encoder and decoder: - These must be functions or methods which have the same - interface as the .encode/.decode methods of Codec instances - (see Codec Interface). The functions/methods are expected to - work in a stateless mode. - - stream_reader and stream_writer: - These need to be factory functions with the following - interface: - - factory(stream,errors='strict') - - The factory functions must return objects providing - the interfaces defined by StreamWriter/StreamReader resp. - (see Codec Interface). Stream codecs can maintain state. - - Possible values for errors are defined in the Codec - section below. - -In case a search function cannot find a given encoding, it should -return None. - -Aliasing support for encodings is left to the search functions -to implement. - -The codecs module will maintain an encoding cache for performance -reasons. Encodings are first looked up in the cache. If not found, the -list of registered search functions is scanned. If no codecs tuple is -found, a LookupError is raised. Otherwise, the codecs tuple is stored -in the cache and returned to the caller. - -To query the Codec instance the following API should be used: - - codecs.lookup(encoding) - -This will either return the found codecs tuple or raise a LookupError. - - -Standard Codecs: ----------------- - -Standard codecs should live inside an encodings/ package directory in the -Standard Python Code Library. The __init__.py file of that directory should -include a Codec Lookup compatible search function implementing a lazy module -based codec lookup. - -Python should provide a few standard codecs for the most relevant -encodings, e.g. - - 'utf-8': 8-bit variable length encoding - 'utf-16': 16-bit variable length encoding (little/big endian) - 'utf-16-le': utf-16 but explicitly little endian - 'utf-16-be': utf-16 but explicitly big endian - 'ascii': 7-bit ASCII codepage - 'iso-8859-1': ISO 8859-1 (Latin 1) codepage - 'unicode-escape': See Unicode Constructors for a definition - 'raw-unicode-escape': See Unicode Constructors for a definition - 'native': Dump of the Internal Format used by Python - -Common aliases should also be provided per default, e.g. 'latin-1' -for 'iso-8859-1'. - -Note: 'utf-16' should be implemented by using and requiring byte order -marks (BOM) for file input/output. - -All other encodings such as the CJK ones to support Asian scripts -should be implemented in separate packages which do not get included -in the core Python distribution and are not a part of this proposal. - - -Codecs Interface Definition: ----------------------------- - -The following base class should be defined in the module -"codecs". They provide not only templates for use by encoding module -implementors, but also define the interface which is expected by the -Unicode implementation. - -Note that the Codec Interface defined here is well suitable for a -larger range of applications. The Unicode implementation expects -Unicode objects on input for .encode() and .write() and character -buffer compatible objects on input for .decode(). Output of .encode() -and .read() should be a Python string and .decode() must return an -Unicode object. - -First, we have the stateless encoders/decoders. These do not work in -chunks as the stream codecs (see below) do, because all components are -expected to be available in memory. - -class Codec: - - """ Defines the interface for stateless encoders/decoders. - - The .encode()/.decode() methods may implement different error - handling schemes by providing the errors argument. These - string values are defined: - - 'strict' - raise an error (or a subclass) - 'ignore' - ignore the character and continue with the next - 'replace' - replace with a suitable replacement character; - Python will use the official U+FFFD REPLACEMENT - CHARACTER for the builtin Unicode codecs. - - """ - def encode(self,input,errors='strict'): - - """ Encodes the object input and returns a tuple (output - object, length consumed). - - errors defines the error handling to apply. It defaults to - 'strict' handling. - - The method may not store state in the Codec instance. Use - StreamCodec for codecs which have to keep state in order to - make encoding/decoding efficient. - - """ - ... - - def decode(self,input,errors='strict'): - - """ Decodes the object input and returns a tuple (output - object, length consumed). - - input must be an object which provides the bf_getreadbuf - buffer slot. Python strings, buffer objects and memory - mapped files are examples of objects providing this slot. - - errors defines the error handling to apply. It defaults to - 'strict' handling. - - The method may not store state in the Codec instance. Use - StreamCodec for codecs which have to keep state in order to - make encoding/decoding efficient. - - """ - ... - -StreamWriter and StreamReader define the interface for stateful -encoders/decoders which work on streams. These allow processing of the -data in chunks to efficiently use memory. If you have large strings in -memory, you may want to wrap them with cStringIO objects and then use -these codecs on them to be able to do chunk processing as well, -e.g. to provide progress information to the user. - -class StreamWriter(Codec): - - def __init__(self,stream,errors='strict'): - - """ Creates a StreamWriter instance. - - stream must be a file-like object open for writing - (binary) data. - - The StreamWriter may implement different error handling - schemes by providing the errors keyword argument. These - parameters are defined: - - 'strict' - raise a ValueError (or a subclass) - 'ignore' - ignore the character and continue with the next - 'replace'- replace with a suitable replacement character - - """ - self.stream = stream - self.errors = errors - - def write(self,object): - - """ Writes the object's contents encoded to self.stream. - """ - data, consumed = self.encode(object,self.errors) - self.stream.write(data) - - def writelines(self, list): - - """ Writes the concatenated list of strings to the stream - using .write(). - """ - self.write(''.join(list)) - - def reset(self): - - """ Flushes and resets the codec buffers used for keeping state. - - Calling this method should ensure that the data on the - output is put into a clean state, that allows appending - of new fresh data without having to rescan the whole - stream to recover state. - - """ - pass - - def __getattr__(self,name, - - getattr=getattr): - - """ Inherit all other methods from the underlying stream. - """ - return getattr(self.stream,name) - -class StreamReader(Codec): - - def __init__(self,stream,errors='strict'): - - """ Creates a StreamReader instance. - - stream must be a file-like object open for reading - (binary) data. - - The StreamReader may implement different error handling - schemes by providing the errors keyword argument. These - parameters are defined: - - 'strict' - raise a ValueError (or a subclass) - 'ignore' - ignore the character and continue with the next - 'replace'- replace with a suitable replacement character; - - """ - self.stream = stream - self.errors = errors - - def read(self,size=-1): - - """ Decodes data from the stream self.stream and returns the - resulting object. - - size indicates the approximate maximum number of bytes to - read from the stream for decoding purposes. The decoder - can modify this setting as appropriate. The default value - -1 indicates to read and decode as much as possible. size - is intended to prevent having to decode huge files in one - step. - - The method should use a greedy read strategy meaning that - it should read as much data as is allowed within the - definition of the encoding and the given size, e.g. if - optional encoding endings or state markers are available - on the stream, these should be read too. - - """ - # Unsliced reading: - if size < 0: - return self.decode(self.stream.read())[0] - - # Sliced reading: - read = self.stream.read - decode = self.decode - data = read(size) - i = 0 - while 1: - try: - object, decodedbytes = decode(data) - except ValueError,why: - # This method is slow but should work under pretty much - # all conditions; at most 10 tries are made - i = i + 1 - newdata = read(1) - if not newdata or i > 10: - raise - data = data + newdata - else: - return object - - def readline(self, size=None): - - """ Read one line from the input stream and return the - decoded data. - - Note: Unlike the .readlines() method, this method inherits - the line breaking knowledge from the underlying stream's - .readline() method -- there is currently no support for - line breaking using the codec decoder due to lack of line - buffering. Subclasses should however, if possible, try to - implement this method using their own knowledge of line - breaking. - - size, if given, is passed as size argument to the stream's - .readline() method. - - """ - if size is None: - line = self.stream.readline() - else: - line = self.stream.readline(size) - return self.decode(line)[0] - - def readlines(self, sizehint=0): - - """ Read all lines available on the input stream - and return them as list of lines. - - Line breaks are implemented using the codec's decoder - method and are included in the list entries. - - sizehint, if given, is passed as size argument to the - stream's .read() method. - - """ - if sizehint is None: - data = self.stream.read() - else: - data = self.stream.read(sizehint) - return self.decode(data)[0].splitlines(1) - - def reset(self): - - """ Resets the codec buffers used for keeping state. - - Note that no stream repositioning should take place. - This method is primarily intended to be able to recover - from decoding errors. - - """ - pass - - def __getattr__(self,name, - - getattr=getattr): - - """ Inherit all other methods from the underlying stream. - """ - return getattr(self.stream,name) - - -Stream codec implementors are free to combine the StreamWriter and -StreamReader interfaces into one class. Even combining all these with -the Codec class should be possible. - -Implementors are free to add additional methods to enhance the codec -functionality or provide extra state information needed for them to -work. The internal codec implementation will only use the above -interfaces, though. - -It is not required by the Unicode implementation to use these base -classes, only the interfaces must match; this allows writing Codecs as -extension types. - -As guideline, large mapping tables should be implemented using static -C data in separate (shared) extension modules. That way multiple -processes can share the same data. - -A tool to auto-convert Unicode mapping files to mapping modules should be -provided to simplify support for additional mappings (see References). - - -Whitespace: ------------ - -The .split() method will have to know about what is considered -whitespace in Unicode. - - -Case Conversion: ----------------- - -Case conversion is rather complicated with Unicode data, since there -are many different conditions to respect. See - - http://www.unicode.org/unicode/reports/tr13/ - -for some guidelines on implementing case conversion. - -For Python, we should only implement the 1-1 conversions included in -Unicode. Locale dependent and other special case conversions (see the -Unicode standard file SpecialCasing.txt) should be left to user land -routines and not go into the core interpreter. - -The methods .capitalize() and .iscapitalized() should follow the case -mapping algorithm defined in the above technical report as closely as -possible. - - -Line Breaks: ------------- - -Line breaking should be done for all Unicode characters having the B -property as well as the combinations CRLF, CR, LF (interpreted in that -order) and other special line separators defined by the standard. - -The Unicode type should provide a .splitlines() method which returns a -list of lines according to the above specification. See Unicode -Methods. - - -Unicode Character Properties: ------------------------------ - -A separate module "unicodedata" should provide a compact interface to -all Unicode character properties defined in the standard's -UnicodeData.txt file. - -Among other things, these properties provide ways to recognize -numbers, digits, spaces, whitespace, etc. - -Since this module will have to provide access to all Unicode -characters, it will eventually have to contain the data from -UnicodeData.txt which takes up around 600kB. For this reason, the data -should be stored in static C data. This enables compilation as shared -module which the underlying OS can shared between processes (unlike -normal Python code modules). - -There should be a standard Python interface for accessing this information -so that other implementors can plug in their own possibly enhanced versions, -e.g. ones that do decompressing of the data on-the-fly. - - -Private Code Point Areas: -------------------------- - -Support for these is left to user land Codecs and not explicitly -integrated into the core. Note that due to the Internal Format being -implemented, only the area between \uE000 and \uF8FF is usable for -private encodings. - - -Internal Format: ----------------- - -The internal format for Unicode objects should use a Python specific -fixed format <PythonUnicode> implemented as 'unsigned short' (or -another unsigned numeric type having 16 bits). Byte order is platform -dependent. - -This format will hold UTF-16 encodings of the corresponding Unicode -ordinals. The Python Unicode implementation will address these values -as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all -currently defined Unicode character points. UTF-16 without surrogates -provides access to about 64k characters and covers all characters in -the Basic Multilingual Plane (BMP) of Unicode. - -It is the Codec's responsibility to ensure that the data they pass to -the Unicode object constructor respects this assumption. The -constructor does not check the data for Unicode compliance or use of -surrogates. - -Future implementations can extend the 32 bit restriction to the full -set of all UTF-16 addressable characters (around 1M characters). - -The Unicode API should provide interface routines from <PythonUnicode> -to the compiler's wchar_t which can be 16 or 32 bit depending on the -compiler/libc/platform being used. - -Unicode objects should have a pointer to a cached Python string object -<defenc> holding the object's value using the <default encoding>. -This is needed for performance and internal parsing (see Internal -Argument Parsing) reasons. The buffer is filled when the first -conversion request to the <default encoding> is issued on the object. - -Interning is not needed (for now), since Python identifiers are -defined as being ASCII only. - -codecs.BOM should return the byte order mark (BOM) for the format -used internally. The codecs module should provide the following -additional constants for convenience and reference (codecs.BOM will -either be BOM_BE or BOM_LE depending on the platform): - - BOM_BE: '\376\377' - (corresponds to Unicode U+0000FEFF in UTF-16 on big endian - platforms == ZERO WIDTH NO-BREAK SPACE) - - BOM_LE: '\377\376' - (corresponds to Unicode U+0000FFFE in UTF-16 on little endian - platforms == defined as being an illegal Unicode character) - - BOM4_BE: '\000\000\376\377' - (corresponds to Unicode U+0000FEFF in UCS-4) - - BOM4_LE: '\377\376\000\000' - (corresponds to Unicode U+0000FFFE in UCS-4) - -Note that Unicode sees big endian byte order as being "correct". The -swapped order is taken to be an indicator for a "wrong" format, hence -the illegal character definition. - -The configure script should provide aid in deciding whether Python can -use the native wchar_t type or not (it has to be a 16-bit unsigned -type). - - -Buffer Interface: ------------------ - -Implement the buffer interface using the <defenc> Python string object -as basis for bf_getcharbuf and the internal buffer for -bf_getreadbuf. If bf_getcharbuf is requested and the <defenc> object -does not yet exist, it is created first. - -Note that as special case, the parser marker "s#" will not return raw -Unicode UTF-16 data (which the bf_getreadbuf returns), but instead -tries to encode the Unicode object using the default encoding and then -returns a pointer to the resulting string object (or raises an -exception in case the conversion fails). This was done in order to -prevent accidentely writing binary data to an output stream which the -other end might not recognize. - -This has the advantage of being able to write to output streams (which -typically use this interface) without additional specification of the -encoding to use. - -If you need to access the read buffer interface of Unicode objects, -use the PyObject_AsReadBuffer() interface. - -The internal format can also be accessed using the 'unicode-internal' -codec, e.g. via u.encode('unicode-internal'). - - -Pickle/Marshalling: -------------------- - -Should have native Unicode object support. The objects should be -encoded using platform independent encodings. - -Marshal should use UTF-8 and Pickle should either choose -Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as -encoding. Using UTF-8 instead of UTF-16 has the advantage of -eliminating the need to store a BOM mark. - - -Regular Expressions: --------------------- - -Secret Labs AB is working on a Unicode-aware regular expression -machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 -internal character buffers. - -Also see - - http://www.unicode.org/unicode/reports/tr18/ - -for some remarks on how to treat Unicode REs. - - -Formatting Markers: -------------------- - -Format markers are used in Python format strings. If Python strings -are used as format strings, the following interpretations should be in -effect: - - '%s': For Unicode objects this will cause coercion of the - whole format string to Unicode. Note that - you should use a Unicode format string to start - with for performance reasons. - -In case the format string is an Unicode object, all parameters are coerced -to Unicode first and then put together and formatted according to the format -string. Numbers are first converted to strings and then to Unicode. - - '%s': Python strings are interpreted as Unicode - string using the <default encoding>. Unicode - objects are taken as is. - -All other string formatters should work accordingly. - -Example: - -u"%s %s" % (u"abc", "abc") == u"abc abc" - - -Internal Argument Parsing: --------------------------- - -These markers are used by the PyArg_ParseTuple() APIs: - - "U": Check for Unicode object and return a pointer to it - - "s": For Unicode objects: return a pointer to the object's - <defenc> buffer (which uses the <default encoding>). - - "s#": Access to the default encoded version of the Unicode object - (see Buffer Interface); note that the length relates to the length - of the default encoded string rather than the Unicode object length. - - "t#": Same as "s#". - - "es": - Takes two parameters: encoding (const char *) and - buffer (char **). - - The input object is first coerced to Unicode in the usual way - and then encoded into a string using the given encoding. - - On output, a buffer of the needed size is allocated and - returned through *buffer as NULL-terminated string. - The encoded may not contain embedded NULL characters. - The caller is responsible for calling PyMem_Free() - to free the allocated *buffer after usage. - - "es#": - Takes three parameters: encoding (const char *), - buffer (char **) and buffer_len (int *). - - The input object is first coerced to Unicode in the usual way - and then encoded into a string using the given encoding. - - If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer) - on input. Output is then copied to *buffer. - - If *buffer is NULL, a buffer of the needed size is - allocated and output copied into it. *buffer is then - updated to point to the allocated memory area. - The caller is responsible for calling PyMem_Free() - to free the allocated *buffer after usage. - - In both cases *buffer_len is updated to the number of - characters written (excluding the trailing NULL-byte). - The output buffer is assured to be NULL-terminated. - -Examples: - -Using "es#" with auto-allocation: - - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char *buffer = NULL; - int buffer_len = 0; - - if (!PyArg_ParseTuple(args, "es#:test_parser", - encoding, &buffer, &buffer_len)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromStringAndSize(buffer, buffer_len); - PyMem_Free(buffer); - return str; - } - -Using "es" with auto-allocation returning a NULL-terminated string: - - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char *buffer = NULL; - - if (!PyArg_ParseTuple(args, "es:test_parser", - encoding, &buffer)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromString(buffer); - PyMem_Free(buffer); - return str; - } - -Using "es#" with a pre-allocated buffer: - - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char _buffer[10]; - char *buffer = _buffer; - int buffer_len = sizeof(_buffer); - - if (!PyArg_ParseTuple(args, "es#:test_parser", - encoding, &buffer, &buffer_len)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromStringAndSize(buffer, buffer_len); - return str; - } - - -File/Stream Output: -------------------- - -Since file.write(object) and most other stream writers use the "s#" or -"t#" argument parsing marker for querying the data to write, the -default encoded string version of the Unicode object will be written -to the streams (see Buffer Interface). - -For explicit handling of files using Unicode, the standard stream -codecs as available through the codecs module should be used. - -The codecs module should provide a short-cut open(filename,mode,encoding) -available which also assures that mode contains the 'b' character when -needed. - - -File/Stream Input: ------------------- - -Only the user knows what encoding the input data uses, so no special -magic is applied. The user will have to explicitly convert the string -data to Unicode objects as needed or use the file wrappers defined in -the codecs module (see File/Stream Output). - - -Unicode Methods & Attributes: ------------------------------ - -All Python string methods, plus: - - .encode([encoding=<default encoding>][,errors="strict"]) - --> see Unicode Output - - .splitlines([include_breaks=0]) - --> breaks the Unicode string into a list of (Unicode) lines; - returns the lines with line breaks included, if include_breaks - is true. See Line Breaks for a specification of how line breaking - is done. - - -Code Base: ----------- - -We should use Fredrik Lundh's Unicode object implementation as basis. -It already implements most of the string methods needed and provides a -well written code base which we can build upon. - -The object sharing implemented in Fredrik's implementation should -be dropped. - - -Test Cases: ------------ - -Test cases should follow those in Lib/test/test_string.py and include -additional checks for the Codec Registry and the Standard Codecs. - - -References: ------------ - -Unicode Consortium: - http://www.unicode.org/ - -Unicode FAQ: - http://www.unicode.org/unicode/faq/ - -Unicode 3.0: - http://www.unicode.org/unicode/standard/versions/Unicode3.0.html - -Unicode-TechReports: - http://www.unicode.org/unicode/reports/techreports.html - -Unicode-Mappings: - ftp://ftp.unicode.org/Public/MAPPINGS/ - -Introduction to Unicode (a little outdated by still nice to read): - http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html - -For comparison: - Introducing Unicode to ECMAScript (aka JavaScript) -- - http://www-4.ibm.com/software/developer/library/internationalization-support.html - -IANA Character Set Names: - ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets - -Discussion of UTF-8 and Unicode support for POSIX and Linux: - http://www.cl.cam.ac.uk/~mgk25/unicode.html - -Encodings: - - Overview: - http://czyborra.com/utf/ - - UTC-2: - http://www.uazone.com/multiling/unicode/ucs2.html - - UTF-7: - Defined in RFC2152, e.g. - http://www.uazone.com/multiling/ml-docs/rfc2152.txt - - UTF-8: - Defined in RFC2279, e.g. - http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt - - UTF-16: - http://www.uazone.com/multiling/unicode/wg2n1035.html - - -History of this Proposal: -------------------------- -1.7: Added note about the changed behaviour of "s#". -1.6: Changed <defencstr> to <defenc> since this is the name used in the - implementation. Added notes about the usage of <defenc> in the - buffer protocol implementation. -1.5: Added notes about setting the <default encoding>. Fixed some - typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>. -1.4: Added note about mixed type comparisons and contains tests. - Changed treating of Unicode objects in format strings (if used - with '%s' % u they will now cause the format string to be - coerced to Unicode, thus producing a Unicode object on return). - Added link to IANA charset names (thanks to Lars Marius Garshol). - Added new codec methods .readline(), .readlines() and .writelines(). -1.3: Added new "es" and "es#" parser markers -1.2: Removed POD about codecs.open() -1.1: Added note about comparisons and hash values. Added note about - case mapping algorithms. Changed stream codecs .read() and - .write() method to match the standard file-like object methods - (bytes consumed information is no longer returned by the methods) -1.0: changed encode Codec method to be symmetric to the decode method - (they both return (object, data consumed) now and thus become - interchangeable); removed __init__ method of Codec class (the - methods are stateless) and moved the errors argument down to the - methods; made the Codec design more generic w/r to type of input - and output objects; changed StreamWriter.flush to StreamWriter.reset - in order to avoid overriding the stream's .flush() method; - renamed .breaklines() to .splitlines(); renamed the module unicodec - to codecs; modified the File I/O section to refer to the stream codecs. -0.9: changed errors keyword argument definition; added 'replace' error - handling; changed the codec APIs to accept buffer like objects on - input; some minor typo fixes; added Whitespace section and - included references for Unicode characters that have the whitespace - and the line break characteristic; added note that search functions - can expect lower-case encoding names; dropped slicing and offsets - in the codec APIs -0.8: added encodings package and raw unicode escape encoding; untabified - the proposal; added notes on Unicode format strings; added - .breaklines() method -0.7: added a whole new set of codec APIs; added a different encoder - lookup scheme; fixed some names -0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding - a real Python string object; changed Buffer Interface to delegate - requests to <defencstr>'s buffer interface; removed the explicit - reference to the unicodec.codecs dictionary (the module can implement - this in way fit for the purpose); removed the settable default - encoding; move UnicodeError from unicodec to exceptions; "s#" - not returns the internal data; passed the UCS-2/UTF-16 checking - from the Unicode constructor to the Codecs -0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, - private use encodings and Unicode character properties -0.4: added Codec interface, notes on %-formatting, changed some encoding - details, added comments on stream wrappers, fixed some discussion - points (most important: Internal Format), clarified the - 'unicode-escape' encoding, added encoding references -0.3: added references, comments on codec modules, the internal format, - bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding - proposed by Tim Peters and fixed repr(u) accordingly -0.2: integrated Guido's suggestions, added stream codecs and file - wrapping -0.1: first version - - ------------------------------------------------------------------------------ -Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com ------------------------------------------------------------------------------ +-Barry |