diff options
Diffstat (limited to 'Doc/c-api/unicode.rst')
-rw-r--r-- | Doc/c-api/unicode.rst | 887 |
1 files changed, 712 insertions, 175 deletions
diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index d1b57d9..89bd148 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -6,38 +6,72 @@ Unicode Objects and Codecs -------------------------- .. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> +.. sectionauthor:: Georg Brandl <georg@python.org> Unicode Objects ^^^^^^^^^^^^^^^ +Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally +use a variety of representations, in order to allow handling the complete range +of Unicode characters while staying memory efficient. There are special cases +for strings where all code points are below 128, 256, or 65536; otherwise, code +points must be below 1114112 (which is the full Unicode range). + +:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached +in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated +and inefficient; it should be avoided in performance- or memory-sensitive +situations. + +Due to the transition between the old APIs and the new APIs, unicode objects +can internally be in two states depending on how they were created: + +* "canonical" unicode objects are all objects created by a non-deprecated + unicode API. They use the most efficient representation allowed by the + implementation. + +* "legacy" unicode objects have been created through one of the deprecated + APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the + :c:type:`Py_UNICODE*` representation; you will have to call + :c:func:`PyUnicode_READY` on them before calling any other API. + + Unicode Type """""""""""" These are the basic Unicode object types used for the Unicode implementation in Python: +.. c:type:: Py_UCS4 + Py_UCS2 + Py_UCS1 + + These types are typedefs for unsigned integer types wide enough to contain + characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with + single Unicode characters, use :c:type:`Py_UCS4`. + + .. versionadded:: 3.3 + .. c:type:: Py_UNICODE - This type represents the storage type which is used by Python internally as - basis for holding Unicode ordinals. Python's default builds use a 16-bit type - for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also - possible to build a UCS4 version of Python (most recent Linux distributions come - with UCS4 builds of Python). These builds then use a 32-bit type for - :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms - where :c:type:`wchar_t` is available and compatible with the chosen Python - Unicode build variant, :c:type:`Py_UNICODE` is a typedef alias for - :c:type:`wchar_t` to enhance native platform compatibility. On all other - platforms, :c:type:`Py_UNICODE` is a typedef alias for either :c:type:`unsigned - short` (UCS2) or :c:type:`unsigned long` (UCS4). + This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type + depending on the platform. -Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep -this in mind when writing extensions or interfaces. + .. versionchanged:: 3.3 + In previous versions, this was a 16-bit type or a 32-bit type depending on + whether you selected a "narrow" or "wide" Unicode version of Python at + build time. -.. c:type:: PyUnicodeObject +.. c:type:: PyASCIIObject + PyCompactUnicodeObject + PyUnicodeObject - This subtype of :c:type:`PyObject` represents a Python Unicode object. + These subtypes of :c:type:`PyObject` represent a Python Unicode object. In + almost all cases, they shouldn't be used directly, since all API functions + that deal with Unicode objects take and return :c:type:`PyObject` pointers. + + .. versionadded:: 3.3 .. c:var:: PyTypeObject PyUnicode_Type @@ -45,10 +79,10 @@ this in mind when writing extensions or interfaces. This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It is exposed to Python code as ``str``. + The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects: - .. c:function:: int PyUnicode_Check(PyObject *o) Return true if the object *o* is a Unicode object or an instance of a Unicode @@ -61,28 +95,106 @@ access internal read-only data of Unicode objects: subtype. -.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) +.. c:function:: int PyUnicode_READY(PyObject *o) - Return the size of the object. *o* has to be a :c:type:`PyUnicodeObject` (not - checked). + Ensure the string object *o* is in the "canonical" representation. This is + required before using any of the access macros described below. + .. XXX expand on when it is not required -.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) + Returns 0 on success and -1 with an exception set on failure, which in + particular happens if memory allocation fails. - Return the size of the object's internal buffer in bytes. *o* has to be a - :c:type:`PyUnicodeObject` (not checked). + .. versionadded:: 3.3 -.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) +.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) + + Return the length of the Unicode string, in code points. *o* has to be a + Unicode object in the "canonical" representation (not checked). + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) + Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) + Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) + + Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 + integer types for direct character access. No checks are performed if the + canonical representation has the correct character size; use + :c:func:`PyUnicode_KIND` to select the right macro. Make sure + :c:func:`PyUnicode_READY` has been called before accessing this. + + .. versionadded:: 3.3 + + +.. c:macro:: PyUnicode_WCHAR_KIND + PyUnicode_1BYTE_KIND + PyUnicode_2BYTE_KIND + PyUnicode_4BYTE_KIND + + Return values of the :c:func:`PyUnicode_KIND` macro. + + .. versionadded:: 3.3 + + +.. c:function:: int PyUnicode_KIND(PyObject *o) + + Return one of the PyUnicode kind constants (see above) that indicate how many + bytes per character this Unicode object uses to store its data. *o* has to + be a Unicode object in the "canonical" representation (not checked). + + .. XXX document "0" return value? + + .. versionadded:: 3.3 + + +.. c:function:: void* PyUnicode_DATA(PyObject *o) + + Return a void pointer to the raw unicode buffer. *o* has to be a Unicode + object in the "canonical" representation (not checked). + + .. versionadded:: 3.3 + + +.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \ + Py_UCS4 value) + + Write into a canonical representation *data* (as obtained with + :c:func:`PyUnicode_DATA`). This macro does not do any sanity checks and is + intended for usage in loops. The caller should cache the *kind* value and + *data* pointer as obtained from other macro calls. *index* is the index in + the string (starts at 0) and *value* is the new code point value which should + be written to that location. + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index) + + Read a code point from a canonical representation *data* (as obtained with + :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) + + Read a character from a Unicode object *o*, which must be in the "canonical" + representation. This is less efficient than :c:func:`PyUnicode_READ` if you + do multiple consecutive reads. + + .. versionadded:: 3.3 - Return a pointer to the internal :c:type:`Py_UNICODE` buffer of the object. *o* - has to be a :c:type:`PyUnicodeObject` (not checked). +.. c:function:: PyUnicode_MAX_CHAR_VALUE(PyObject *o) -.. c:function:: const char* PyUnicode_AS_DATA(PyObject *o) + Return the maximum code point that is suitable for creating another string + based on *o*, which must be in the "canonical" representation. This is + always an approximation but more efficient than iterating over the string. - Return a pointer to the internal buffer of the object. *o* has to be a - :c:type:`PyUnicodeObject` (not checked). + .. versionadded:: 3.3 .. c:function:: int PyUnicode_ClearFreeList() @@ -90,6 +202,46 @@ access internal read-only data of Unicode objects: Clear the free list. Return the total number of freed items. +.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) + + Return the size of the deprecated :c:type:`Py_UNICODE` representation, in + code units (this includes surrogate pairs as 2 units). *o* has to be a + Unicode object (not checked). + + .. deprecated-removed:: 3.3 4.0 + Part of the old-style Unicode API, please migrate to using + :c:func:`PyUnicode_GET_LENGTH`. + + +.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) + + Return the size of the deprecated :c:type:`Py_UNICODE` representation in + bytes. *o* has to be a Unicode object (not checked). + + .. deprecated-removed:: 3.3 4.0 + Part of the old-style Unicode API, please migrate to using + :c:func:`PyUnicode_GET_LENGTH`. + + +.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) + const char* PyUnicode_AS_DATA(PyObject *o) + + Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The + ``AS_DATA`` form casts the pointer to :c:type:`const char *`. *o* has to be + a Unicode object (not checked). + + .. versionchanged:: 3.3 + This macro is now inefficient -- because in many cases the + :c:type:`Py_UNICODE` representation does not exist and needs to be created + -- and can fail (return *NULL* with an exception set). Try to port the + code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use + :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. + + .. deprecated-removed:: 3.3 4.0 + Part of the old-style Unicode API, please migrate to using the + :c:func:`PyUnicode_nBYTE_DATA` family of macros. + + Unicode Character Properties """""""""""""""""""""""""""" @@ -166,16 +318,25 @@ These APIs can be used for fast direct character conversions: Return the character *ch* converted to lower case. + .. deprecated:: 3.3 + This function uses simple case mappings. + .. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) Return the character *ch* converted to upper case. + .. deprecated:: 3.3 + This function uses simple case mappings. + .. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) Return the character *ch* converted to title case. + .. deprecated:: 3.3 + This function uses simple case mappings. + .. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) @@ -195,31 +356,66 @@ These APIs can be used for fast direct character conversions: possible. This macro does not raise exceptions. -Plain Py_UNICODE -"""""""""""""""" +These APIs can be used to work with surrogates: + +.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) + + Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). + +.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) + + Check if *ch* is an high surrogate (``0xD800 <= ch <= 0xDBFF``). + +.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) + + Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). + +.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) + + Join two surrogate characters and return a single Py_UCS4 value. + *high* and *low* are respectively the leading and trailing surrogates in a + surrogate pair. + + +Creating and accessing Unicode strings +"""""""""""""""""""""""""""""""""""""" To create Unicode objects and access their basic sequence properties, use these APIs: +.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) -.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) + Create a new Unicode object. *maxchar* should be the true maximum code point + to be placed in the string. As an approximation, it can be rounded up to the + nearest value in the sequence 127, 255, 65535, 1114111. - Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* - may be *NULL* which causes the contents to be undefined. It is the user's - responsibility to fill in the needed data. The buffer is copied into the new - object. If the buffer is not *NULL*, the return value might be a shared object. - Therefore, modification of the resulting Unicode object is only allowed when *u* - is *NULL*. + This is the recommended way to allocate a new Unicode object. Objects + created using this function are not resizable. + + .. versionadded:: 3.3 + + +.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ + Py_ssize_t size) + + Create a new Unicode object with the given *kind* (possible values are + :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by + :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* + units of 1, 2 or 4 bytes per character, as given by the kind. + + .. versionadded:: 3.3 .. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) - Create a Unicode object from the char buffer *u*. The bytes will be interpreted - as being UTF-8 encoded. *u* may also be *NULL* which - causes the contents to be undefined. It is the user's responsibility to fill in - the needed data. The buffer is copied into the new object. If the buffer is not - *NULL*, the return value might be a shared object. Therefore, modification of - the resulting Unicode object is only allowed when *u* is *NULL*. + Create a Unicode object from the char buffer *u*. The bytes will be + interpreted as being UTF-8 encoded. The buffer is copied into the new + object. If the buffer is not *NULL*, the return value might be a shared + object, i.e. modification of the data is not allowed. + + If *u* is *NULL*, this function behaves like :c:func:`PyUnicode_FromUnicode` + with the buffer set to *NULL*. This usage is deprecated in favor of + :c:func:`PyUnicode_New`. .. c:function:: PyObject *PyUnicode_FromString(const char *u) @@ -260,18 +456,27 @@ APIs: | :attr:`%ld` | long | Exactly equivalent to | | | | ``printf("%ld")``. | +-------------------+---------------------+--------------------------------+ + | :attr:`%li` | long | Exactly equivalent to | + | | | ``printf("%li")``. | + +-------------------+---------------------+--------------------------------+ | :attr:`%lu` | unsigned long | Exactly equivalent to | | | | ``printf("%lu")``. | +-------------------+---------------------+--------------------------------+ | :attr:`%lld` | long long | Exactly equivalent to | | | | ``printf("%lld")``. | +-------------------+---------------------+--------------------------------+ + | :attr:`%lli` | long long | Exactly equivalent to | + | | | ``printf("%lli")``. | + +-------------------+---------------------+--------------------------------+ | :attr:`%llu` | unsigned long long | Exactly equivalent to | | | | ``printf("%llu")``. | +-------------------+---------------------+--------------------------------+ | :attr:`%zd` | Py_ssize_t | Exactly equivalent to | | | | ``printf("%zd")``. | +-------------------+---------------------+--------------------------------+ + | :attr:`%zi` | Py_ssize_t | Exactly equivalent to | + | | | ``printf("%zi")``. | + +-------------------+---------------------+--------------------------------+ | :attr:`%zu` | size_t | Exactly equivalent to | | | | ``printf("%zu")``. | +-------------------+---------------------+--------------------------------+ @@ -322,27 +527,178 @@ APIs: .. versionchanged:: 3.2 Support for ``"%lld"`` and ``"%llu"`` added. + .. versionchanged:: 3.3 + Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. + .. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two arguments. + +.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ + const char *encoding, const char *errors) + + Coerce an encoded object *obj* to an Unicode object and return a reference with + incremented refcount. + + :class:`bytes`, :class:`bytearray` and other char buffer compatible objects + are decoded according to the given *encoding* and using the error handling + defined by *errors*. Both can be *NULL* to have the interface use the default + values (see the next section for details). + + All other objects, including Unicode objects, cause a :exc:`TypeError` to be + set. + + The API returns *NULL* if there was an error. The caller is responsible for + decref'ing the returned objects. + + +.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) + + Return the length of the Unicode object, in code points. + + .. versionadded:: 3.3 + + +.. c:function:: int PyUnicode_CopyCharacters(PyObject *to, Py_ssize_t to_start, \ + PyObject *to, Py_ssize_t from_start, Py_ssize_t how_many) + + Copy characters from one Unicode object into another. This function performs + character conversion when necessary and falls back to :c:func:`memcpy` if + possible. Returns ``-1`` and sets an exception on error, otherwise returns + ``0``. + + .. versionadded:: 3.3 + + +.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ + Py_ssize_t length, Py_UCS4 fill_char) + + Fill a string with a character: write *fill_char* into + ``unicode[start:start+length]``. + + Fail if *fill_char* is bigger than the string maximum character, or if the + string has more than 1 reference. + + Return the number of written character, or return ``-1`` and raise an + exception on error. + + .. versionadded:: 3.3 + + +.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ + Py_UCS4 character) + + Write a character to a string. The string must have been created through + :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, + the string must not be shared, or have been hashed yet. + + This function checks that *unicode* is a Unicode object, that the index is + not out of bounds, and that the object can be modified safely (i.e. that it + its reference count is one), in contrast to the macro version + :c:func:`PyUnicode_WRITE_CHAR`. + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) + + Read a character from a string. This function checks that *unicode* is a + Unicode object and the index is not out of bounds, in contrast to the macro + version :c:func:`PyUnicode_READ_CHAR`. + + .. versionadded:: 3.3 + + +.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ + Py_ssize_t end) + + Return a substring of *str*, from character index *start* (included) to + character index *end* (excluded). Negative indices are not supported. + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ + Py_ssize_t buflen, int copy_null) + + Copy the string *u* into a UCS4 buffer, including a null character, if + *copy_null* is set. Returns *NULL* and sets an exception on error (in + particular, a :exc:`ValueError` if *buflen* is smaller than the length of + *u*). *buffer* is returned on success. + + .. versionadded:: 3.3 + + +.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) + + Copy the string *u* into a new UCS4 buffer that is allocated using + :c:func:`PyMem_Malloc`. If this fails, *NULL* is returned with a + :exc:`MemoryError` set. + + .. versionadded:: 3.3 + + +Deprecated Py_UNICODE APIs +"""""""""""""""""""""""""" + +.. deprecated-removed:: 3.3 4.0 + +These API functions are deprecated with the implementation of :pep:`393`. +Extension modules can continue using them, as they will not be removed in Python +3.x, but need to be aware that their use can now cause performance and memory hits. + + +.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) + + Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* + may be *NULL* which causes the contents to be undefined. It is the user's + responsibility to fill in the needed data. The buffer is copied into the new + object. + + If the buffer is not *NULL*, the return value might be a shared object. + Therefore, modification of the resulting Unicode object is only allowed when + *u* is *NULL*. + + If the buffer is *NULL*, :c:func:`PyUnicode_READY` must be called once the + string content has been filled before using any of the access macros such as + :c:func:`PyUnicode_KIND`. + + Please migrate to using :c:func:`PyUnicode_FromKindAndData` or + :c:func:`PyUnicode_New`. + + +.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) + + Return a read-only pointer to the Unicode object's internal + :c:type:`Py_UNICODE` buffer, or *NULL* on error. This will create the + :c:type:`Py_UNICODE*` representation of the object if it is not yet + available. Note that the resulting :c:type:`Py_UNICODE` string may contain + embedded null characters, which would cause the string to be truncated when + used in most C functions. + + Please migrate to using :c:func:`PyUnicode_AsUCS4`, + :c:func:`PyUnicode_Substring`, :c:func:`PyUnicode_ReadChar` or similar new + APIs. + + .. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size) Create a Unicode object by replacing all decimal digits in :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9 - according to their decimal value. Return *NULL* if an exception - occurs. + according to their decimal value. Return *NULL* if an exception occurs. -.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) +.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) - Return a read-only pointer to the Unicode object's internal - :c:type:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object. - Note that the resulting :c:type:`Py_UNICODE*` string may contain embedded - null characters, which would cause the string to be truncated when used in - most C functions. + Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` + array length in *size*. Note that the resulting :c:type:`Py_UNICODE*` string + may contain embedded null characters, which would cause the string to be + truncated when used in most C functions. + + .. versionadded:: 3.3 .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode) @@ -350,44 +706,77 @@ APIs: Create a copy of a Unicode string ending with a nul character. Return *NULL* and raise a :exc:`MemoryError` exception on memory allocation failure, otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free - the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may contain - embedded null characters, which would cause the string to be truncated when - used in most C functions. + the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may + contain embedded null characters, which would cause the string to be + truncated when used in most C functions. .. versionadded:: 3.2 + Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs. + .. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) - Return the length of the Unicode object. + Return the size of the deprecated :c:type:`Py_UNICODE` representation, in + code units (this includes surrogate pairs as 2 units). + Please migrate to using :c:func:`PyUnicode_GetLength`. -.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors) - Coerce an encoded object *obj* to an Unicode object and return a reference with - incremented refcount. +.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) - :class:`bytes`, :class:`bytearray` and other char buffer compatible objects - are decoded according to the given *encoding* and using the error handling - defined by *errors*. Both can be *NULL* to have the interface use the default - values (see the next section for details). + Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used + throughout the interpreter whenever coercion to Unicode is needed. - All other objects, including Unicode objects, cause a :exc:`TypeError` to be - set. - The API returns *NULL* if there was an error. The caller is responsible for - decref'ing the returned objects. +Locale Encoding +""""""""""""""" +The current locale encoding can be used to decode text from the operating +system. -.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) +.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ + Py_ssize_t len, \ + const char *errors) + + Decode a string from the current locale encoding. The supported + error handlers are ``"strict"`` and ``"surrogateescape"`` + (:pep:`383`). The decoder uses ``"strict"`` error handler if + *errors* is ``NULL``. *str* must end with a null character but + cannot contain embedded null characters. + + .. seealso:: + + Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from + :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at + Python startup). + + .. versionadded:: 3.3 + + +.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) + + Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string + length using :c:func:`strlen`. + + .. versionadded:: 3.3 - Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used - throughout the interpreter whenever coercion to Unicode is needed. -If the platform supports :c:type:`wchar_t` and provides a header file wchar.h, -Python can interface directly to this type using the following functions. -Support is optimized if Python's own :c:type:`Py_UNICODE` type is identical to -the system's :c:type:`wchar_t`. +.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) + + Encode a Unicode object to the current locale encoding. The + supported error handlers are ``"strict"`` and ``"surrogateescape"`` + (:pep:`383`). The encoder uses ``"strict"`` error handler if + *errors* is ``NULL``. Return a :class:`bytes` object. *str* cannot + contain embedded null characters. + + .. seealso:: + + Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to + :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at + Python startup). + + .. versionadded:: 3.3 File System Encoding @@ -425,19 +814,26 @@ used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function: .. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the - ``'surrogateescape'`` error handler, or ``'strict'`` on Windows. + ``"surrogateescape"`` error handler, or ``"strict"`` on Windows. If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the locale encoding. + .. seealso:: + + :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the + locale encoding and cannot be modified later. If you need to decode a + string from the current locale encoding, use + :c:func:`PyUnicode_DecodeLocaleAndSize`. + .. versionchanged:: 3.2 - Use ``'strict'`` error handler on Windows. + Use ``"strict"`` error handler on Windows. .. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding` - and the ``'surrogateescape'`` error handler, or ``'strict'`` on Windows. + and the ``"surrogateescape"`` error handler, or ``"strict"`` on Windows. If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the locale encoding. @@ -445,19 +841,26 @@ used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function: Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. .. versionchanged:: 3.2 - Use ``'strict'`` error handler on Windows. + Use ``"strict"`` error handler on Windows. .. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the - ``'surrogateescape'`` error handler, or ``'strict'`` on Windows, and return + ``"surrogateescape"`` error handler, or ``"strict"`` on Windows, and return :class:`bytes`. Note that the resulting :class:`bytes` object may contain null bytes. If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the locale encoding. + .. seealso:: + + :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the + locale encoding and cannot be modified later. If you need to encode a + string to the current locale encoding, use + :c:func:`PyUnicode_EncodeLocale`. + .. versionadded:: 3.2 @@ -479,9 +882,9 @@ wchar_t Support Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 0-termination character). Return the number of :c:type:`wchar_t` characters - copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t` + copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t*` string may or may not be 0-terminated. It is the responsibility of the caller - to make sure that the :c:type:`wchar_t` string is 0-terminated in case this is + to make sure that the :c:type:`wchar_t*` string is 0-terminated in case this is required by the application. Also, note that the :c:type:`wchar_t*` string might contain null characters, which would cause the string to be truncated when used with most C functions. @@ -497,12 +900,32 @@ wchar_t Support Returns a buffer allocated by :c:func:`PyMem_Alloc` (use :c:func:`PyMem_Free` to free it) on success. On error, returns *NULL*, *\*size* is undefined and raises a :exc:`MemoryError`. Note that the - resulting :c:type:`wchar_t*` string might contain null characters, which + resulting :c:type:`wchar_t` string might contain null characters, which would cause the string to be truncated when used with most C functions. .. versionadded:: 3.2 +UCS4 Support +"""""""""""" + +.. versionadded:: 3.3 + +.. XXX are these meant to be public? + +.. c:function:: size_t Py_UCS4_strlen(const Py_UCS4 *u) + Py_UCS4* Py_UCS4_strcpy(Py_UCS4 *s1, const Py_UCS4 *s2) + Py_UCS4* Py_UCS4_strncpy(Py_UCS4 *s1, const Py_UCS4 *s2, size_t n) + Py_UCS4* Py_UCS4_strcat(Py_UCS4 *s1, const Py_UCS4 *s2) + int Py_UCS4_strcmp(const Py_UCS4 *s1, const Py_UCS4 *s2) + int Py_UCS4_strncmp(const Py_UCS4 *s1, const Py_UCS4 *s2, size_t n) + Py_UCS4* Py_UCS4_strchr(const Py_UCS4 *s, Py_UCS4 c) + Py_UCS4* Py_UCS4_strrchr(const Py_UCS4 *s, Py_UCS4 c) + + These utility functions work on strings of :c:type:`Py_UCS4` characters and + otherwise behave like the C standard library functions with the same name. + + .. _builtincodecs: Built-in Codecs @@ -537,7 +960,8 @@ Generic Codecs These are the generic codec APIs: -.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) +.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ + const char *encoding, const char *errors) Create a Unicode object by decoding *size* bytes of the encoded string *s*. *encoding* and *errors* have the same meaning as the parameters of the same name @@ -546,7 +970,18 @@ These are the generic codec APIs: the codec. -.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors) +.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ + const char *encoding, const char *errors) + + Encode a Unicode object and return the result as Python bytes object. + *encoding* and *errors* have the same meaning as the parameters of the same + name in the Unicode :meth:`encode` method. The codec to be used is looked up + using the Python codec registry. Return *NULL* if an exception was raised by + the codec. + + +.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \ + const char *encoding, const char *errors) Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python bytes object. *encoding* and *errors* have the same meaning as the @@ -554,14 +989,9 @@ These are the generic codec APIs: to be used is looked up using the Python codec registry. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors) - - Encode a Unicode object and return the result as Python bytes object. - *encoding* and *errors* have the same meaning as the parameters of the same - name in the Unicode :meth:`encode` method. The codec to be used is looked up - using the Python codec registry. Return *NULL* if an exception was raised by - the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsEncodedString`. UTF-8 Codecs @@ -576,7 +1006,8 @@ These are the UTF-8 codec APIs: *s*. Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) +.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ + const char *errors, Py_ssize_t *consumed) If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF8`. If *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be @@ -584,18 +1015,45 @@ These are the UTF-8 codec APIs: that have been decoded will be stored in *consumed*. +.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) + + Encode a Unicode object using UTF-8 and return the result as Python bytes + object. Error handling is "strict". Return *NULL* if an exception was + raised by the codec. + + +.. c:function:: char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) + + Return a pointer to the default encoding (UTF-8) of the Unicode object, and + store the size of the encoded representation (in bytes) in *size*. *size* + can be *NULL*, in this case no size will be stored. + + In the case of an error, *NULL* is returned with an exception set and no + *size* is stored. + + This caches the UTF-8 representation of the string in the Unicode object, and + subsequent calls will return a pointer to the same buffer. The caller is not + responsible for deallocating the buffer. + + .. versionadded:: 3.3 + + +.. c:function:: char* PyUnicode_AsUTF8(PyObject *unicode) + + As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. + + .. versionadded:: 3.3 + + .. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a Python bytes object. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) - - Encode a Unicode object using UTF-8 and return the result as Python bytes - object. Error handling is "strict". Return *NULL* if an exception was - raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsUTF8String` or :c:func:`PyUnicode_AsUTF8AndSize`. UTF-32 Codecs @@ -604,7 +1062,8 @@ UTF-32 Codecs These are the UTF-32 codec APIs: -.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder) +.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ + const char *errors, int *byteorder) Decode *size* bytes from a UTF-32 encoded buffer string and return the corresponding Unicode object. *errors* (if non-*NULL*) defines the error @@ -625,14 +1084,13 @@ These are the UTF-32 codec APIs: After completion, *\*byteorder* is set to the current byte order at the end of input data. - In a narrow build codepoints outside the BMP will be decoded as surrogate pairs. - If *byteorder* is *NULL*, the codec starts in native order mode. Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) +.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ + const char *errors, int *byteorder, Py_ssize_t *consumed) If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF32`. If *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat @@ -641,7 +1099,15 @@ These are the UTF-32 codec APIs: that have been decoded will be stored in *consumed*. -.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) +.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) + + Return a Python byte string using the UTF-32 encoding in native byte + order. The string always starts with a BOM mark. Error handling is "strict". + Return *NULL* if an exception was raised by the codec. + + +.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \ + const char *errors, int byteorder) Return a Python bytes object holding the UTF-32 encoded value of the Unicode data in *s*. Output is written according to the following byte order:: @@ -658,12 +1124,9 @@ These are the UTF-32 codec APIs: Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) - - Return a Python byte string using the UTF-32 encoding in native byte - order. The string always starts with a BOM mark. Error handling is "strict". - Return *NULL* if an exception was raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsUTF32String`. UTF-16 Codecs @@ -672,7 +1135,8 @@ UTF-16 Codecs These are the UTF-16 codec APIs: -.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder) +.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ + const char *errors, int *byteorder) Decode *size* bytes from a UTF-16 encoded buffer string and return the corresponding Unicode object. *errors* (if non-*NULL*) defines the error @@ -699,7 +1163,8 @@ These are the UTF-16 codec APIs: Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) +.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ + const char *errors, int *byteorder, Py_ssize_t *consumed) If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF16`. If *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat @@ -708,7 +1173,15 @@ These are the UTF-16 codec APIs: number of bytes that have been decoded will be stored in *consumed*. -.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) +.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) + + Return a Python byte string using the UTF-16 encoding in native byte + order. The string always starts with a BOM mark. Error handling is "strict". + Return *NULL* if an exception was raised by the codec. + + +.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \ + const char *errors, int byteorder) Return a Python bytes object holding the UTF-16 encoded value of the Unicode data in *s*. Output is written according to the following byte order:: @@ -726,12 +1199,9 @@ These are the UTF-16 codec APIs: Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) - - Return a Python byte string using the UTF-16 encoding in native byte - order. The string always starts with a BOM mark. Error handling is "strict". - Return *NULL* if an exception was raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsUTF16String`. UTF-7 Codecs @@ -746,7 +1216,8 @@ These are the UTF-7 codec APIs: *s*. Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) +.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ + const char *errors, Py_ssize_t *consumed) If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF7`. If *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not @@ -754,7 +1225,8 @@ These are the UTF-7 codec APIs: bytes that have been decoded will be stored in *consumed*. -.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors) +.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \ + int base64SetO, int base64WhiteSpace, const char *errors) Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and return a Python bytes object. Return *NULL* if an exception was raised by @@ -765,6 +1237,11 @@ These are the UTF-7 codec APIs: nonzero, whitespace will be encoded in base-64. Both are set to zero for the Python "utf-7" codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API. + + .. XXX replace with what? + Unicode-Escape Codecs """"""""""""""""""""" @@ -772,24 +1249,29 @@ Unicode-Escape Codecs These are the "Unicode Escape" codec APIs: -.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) +.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ + Py_ssize_t size, const char *errors) Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded string *s*. Return *NULL* if an exception was raised by the codec. +.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) + + Encode a Unicode object using Unicode-Escape and return the result as Python + string object. Error handling is "strict". Return *NULL* if an exception was + raised by the codec. + + .. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and return a Python string object. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) - - Encode a Unicode object using Unicode-Escape and return the result as Python - string object. Error handling is "strict". Return *NULL* if an exception was - raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsUnicodeEscapeString`. Raw-Unicode-Escape Codecs @@ -798,19 +1280,13 @@ Raw-Unicode-Escape Codecs These are the "Raw Unicode Escape" codec APIs: -.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) +.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ + Py_ssize_t size, const char *errors) Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape encoded string *s*. Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors) - - Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape - and return a Python string object. Return *NULL* if an exception was raised by - the codec. - - .. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) Encode a Unicode object using Raw-Unicode-Escape and return the result as @@ -818,6 +1294,18 @@ These are the "Raw Unicode Escape" codec APIs: was raised by the codec. +.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \ + Py_ssize_t size, const char *errors) + + Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape + and return a Python string object. Return *NULL* if an exception was raised by + the codec. + + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsRawUnicodeEscapeString`. + + Latin-1 Codecs """""""""""""" @@ -831,18 +1319,22 @@ ordinals and only these are accepted by the codecs during encoding. *s*. Return *NULL* if an exception was raised by the codec. +.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) + + Encode a Unicode object using Latin-1 and return the result as Python bytes + object. Error handling is "strict". Return *NULL* if an exception was + raised by the codec. + + .. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and return a Python bytes object. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) - - Encode a Unicode object using Latin-1 and return the result as Python bytes - object. Error handling is "strict". Return *NULL* if an exception was - raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsLatin1String`. ASCII Codecs @@ -858,18 +1350,22 @@ codes generate errors. *s*. Return *NULL* if an exception was raised by the codec. +.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) + + Encode a Unicode object using ASCII and return the result as Python bytes + object. Error handling is "strict". Return *NULL* if an exception was + raised by the codec. + + .. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and return a Python bytes object. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) - - Encode a Unicode object using ASCII and return the result as Python bytes - object. Error handling is "strict". Return *NULL* if an exception was - raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsASCIIString`. Character Map Codecs @@ -898,7 +1394,8 @@ characters to different code points. These are the mapping codec APIs: -.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors) +.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, \ + PyObject *mapping, const char *errors) Create a Unicode object by decoding *size* bytes of the encoded string *s* using the given *mapping* object. Return *NULL* if an exception was raised by the @@ -908,13 +1405,6 @@ These are the mapping codec APIs: treated as "undefined mapping". -.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors) - - Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given - *mapping* object and return a Python string object. Return *NULL* if an - exception was raised by the codec. - - .. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) Encode a Unicode object using the given *mapping* object and return the result @@ -924,7 +1414,8 @@ These are the mapping codec APIs: The following codec API is special in that maps Unicode to Unicode. -.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors) +.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \ + PyObject *table, const char *errors) Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a character mapping *table* to it and return the resulting Unicode object. Return @@ -937,6 +1428,22 @@ The following codec API is special in that maps Unicode to Unicode. and sequences work well. Unmapped character ordinals (ones which cause a :exc:`LookupError`) are left untouched and are copied as-is. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API. + + .. XXX replace with what? + + +.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \ + PyObject *mapping, const char *errors) + + Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given + *mapping* object and return a Python string object. Return *NULL* if an + exception was raised by the codec. + + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsCharmapString`. MBCS codecs for Windows @@ -953,7 +1460,8 @@ the user settings on the machine running the codec. Return *NULL* if an exception was raised by the codec. -.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed) +.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, \ + const char *errors, int *consumed) If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeMBCS`. If *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode @@ -961,18 +1469,31 @@ the user settings on the machine running the codec. in *consumed*. +.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) + + Encode a Unicode object using MBCS and return the result as Python bytes + object. Error handling is "strict". Return *NULL* if an exception was + raised by the codec. + + +.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) + + Encode the Unicode object using the specified code page and return a Python + bytes object. Return *NULL* if an exception was raised by the codec. Use + :c:data:`CP_ACP` code page to get the MBCS encoder. + + .. versionadded:: 3.3 + + .. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return a Python bytes object. Return *NULL* if an exception was raised by the codec. - -.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) - - Encode a Unicode object using MBCS and return the result as Python bytes - object. Error handling is "strict". Return *NULL* if an exception was - raised by the codec. + .. deprecated-removed:: 3.3 4.0 + Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using + :c:func:`PyUnicode_AsMBCSString` or :c:func:`PyUnicode_EncodeCodePage`. Methods & Slots @@ -1011,7 +1532,8 @@ They all return *NULL* or ``-1`` if an exception occurs. characters are not included in the resulting strings. -.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) +.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, \ + const char *errors) Translate a string by applying a character mapping table to it and return the resulting Unicode object. @@ -1033,14 +1555,16 @@ They all return *NULL* or ``-1`` if an exception occurs. Unicode string. -.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) +.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ + Py_ssize_t start, Py_ssize_t end, int direction) Return 1 if *substr* matches ``str[start:end]`` at the given tail end (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match), 0 otherwise. Return ``-1`` if an error occurred. -.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) +.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ + Py_ssize_t start, Py_ssize_t end, int direction) Return the first position of *substr* in ``str[start:end]`` using the given *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a @@ -1049,13 +1573,27 @@ They all return *NULL* or ``-1`` if an exception occurs. occurred and an exception has been set. -.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end) +.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ + Py_ssize_t start, Py_ssize_t end, int direction) + + Return the first position of the character *ch* in ``str[start:end]`` using + the given *direction* (*direction* == 1 means to do a forward search, + *direction* == -1 a backward search). The return value is the index of the + first match; a value of ``-1`` indicates that no match was found, and ``-2`` + indicates that an error occurred and an exception has been set. + + .. versionadded:: 3.3 + + +.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ + Py_ssize_t start, Py_ssize_t end) Return the number of non-overlapping occurrences of *substr* in ``str[start:end]``. Return ``-1`` if an error occurred. -.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount) +.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ + PyObject *replstr, Py_ssize_t maxcount) Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and return the resulting Unicode object. *maxcount* == -1 means replace all @@ -1076,7 +1614,7 @@ They all return *NULL* or ``-1`` if an exception occurs. ISO-8859-1 if it contains non-ASCII characters". -.. c:function:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) +.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) Rich compare two unicode strings and return one of the following: @@ -1103,8 +1641,8 @@ They all return *NULL* or ``-1`` if an exception occurs. Check whether *element* is contained in *container* and return true or false accordingly. - *element* has to coerce to a one element Unicode string. ``-1`` is returned if - there was an error. + *element* has to coerce to a one element Unicode string. ``-1`` is returned + if there was an error. .. c:function:: void PyUnicode_InternInPlace(PyObject **string) @@ -1123,7 +1661,6 @@ They all return *NULL* or ``-1`` if an exception occurs. .. c:function:: PyObject* PyUnicode_InternFromString(const char *v) A combination of :c:func:`PyUnicode_FromString` and - :c:func:`PyUnicode_InternInPlace`, returning either a new unicode string object - that has been interned, or a new ("owned") reference to an earlier interned - string object with the same value. - + :c:func:`PyUnicode_InternInPlace`, returning either a new unicode string + object that has been interned, or a new ("owned") reference to an earlier + interned string object with the same value. |