summaryrefslogtreecommitdiffstats
path: root/Include/unicodeobject.h
Commit message (Collapse)AuthorAgeFilesLines
* _PyUnicode_IsWhitespace(),Tim Peters2005-10-291-2/+2
| | | | | | | | | _PyUnicode_IsLinebreak(): Changed the declarations to match the definitions. Don't know why they differed; MSVC warned about it; don't know why only these two functions use "const". Someone who does may want to do something saner ;-).
* SF bug #1251300: On UCS-4 builds the "unicode-internal" codec will now complainWalter Dörwald2005-08-301-0/+10
| | | | | about illegal code points. The codec now supports PEP 293 style error handlers. (This is a variant of the Nik Haldimann's patch that detects truncated data)
* Correct the handling of 0-termination of PyUnicode_AsWideChar()Marc-André Lemburg2004-11-221-2/+8
| | | | | | | | and its usage in PyLocale_strcoll(). Clarify the documentation on this. Thanks to Andreas Degert for pointing this out.
* SF patch #1056231: typo in comment (unicodeobject.h)Raymond Hettinger2004-10-311-1/+1
|
* SF patch #998993: The UTF-8 and the UTF-16 stateful decoders now supportWalter Dörwald2004-09-071-0/+21
| | | | | | | | | | | decoding incomplete input (when the input stream is temporarily exhausted). codecs.StreamReader now implements buffering, which enables proper readline support for the UTF-16 decoders. codecs.StreamReader.read() has a new argument chars which specifies the number of characters to return. codecs.StreamReader.readline() and codecs.StreamReader.readlines() have a new argument keepends. Trailing "\n"s will be stripped from the lines if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and PyUnicode_DecodeUTF16Stateful.
* SF #989185: Drop unicode.iswide() and unicode.width() and addHye-Shik Chang2004-08-041-18/+0
| | | | | | | | | | | | unicodedata.east_asian_width(). You can still implement your own simple width() function using it like this: def width(u): w = 0 for c in unicodedata.normalize('NFC', u): cwidth = unicodedata.east_asian_width(c) if cwidth in ('W', 'F'): w += 2 else: w += 1 return w
* Allow string and unicode return types from .encode()/.decode()Marc-André Lemburg2004-07-081-0/+11
| | | | | methods on string and unicode objects. Added unicode.decode() which was missing for no apparent reason.
* - SF #962502: Add two more methods for unicode type; width() andHye-Shik Chang2004-06-021-0/+18
| | | | | | | iswide() for east asian width manipulation. (Inspired by David Goodger, Reviewed by Martin v. Loewis) - Move _PyUnicode_TypeRecord.flags to the end of the struct so that no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)
* Add rsplit method for str and unicode builtin types.Hye-Shik Chang2003-12-151-0/+20
| | | | | SF feature request #801847. Original patch is written by Sean Reifschneider.
* Add name mangling for new PyUnicode_FromOrdinal() and fix declarationMarc-André Lemburg2002-08-121-1/+3
| | | | to use new extern macro.
* Excise DL_EXPORT from Include.Mark Hammond2002-08-121-72/+72
| | | | Thanks to Skip Montanaro and Kalle Svensson for the patches.
* Add C API PyUnicode_FromOrdinal() which exposes unichr() at C level.Marc-André Lemburg2002-08-111-0/+12
| | | | | | | u'%c' will now raise a ValueError in case the argument is an integer outside the valid range of Unicode code point ordinals. Closes SF bug #593581.
* Fix for bug [ 561796 ] string.find causes lazy errorMarc-André Lemburg2002-05-291-1/+2
|
* Apply patch diff.txt from SF feature requestWalter Dörwald2002-04-221-0/+7
| | | | | | | | | http://www.python.org/sf/444708 This adds the optional argument for str.strip to unicode.strip too and makes it possible to call str.strip with a unicode argument and unicode.strip with a str argument.
* SF patch #470578: Fixes to synchronize unicode() and str()Guido van Rossum2001-10-191-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements what we have discussed on python-dev late in September: str(obj) and unicode(obj) should behave similar, while the old behaviour is retained for unicode(obj, encoding, errors). The patch also adds a new feature with which objects can provide unicode(obj) with input data: the __unicode__ method. Currently no new tp_unicode slot is implemented; this is left as option for the future. Note that PyUnicode_FromEncodedObject() no longer accepts Unicode objects as input. The API name already suggests that Unicode objects do not belong in the list of acceptable objects and the functionality was only needed because PyUnicode_FromEncodedObject() was being used directly by unicode(). The latter was changed in the discussed way: * unicode(obj) calls PyObject_Unicode() * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject() One thing left open to discussion is whether to leave the PyUnicode_FromObject() API as a thin API extension on top of PyUnicode_FromEncodedObject() or to turn it into a (macro) alias for PyObject_Unicode() and deprecate it. Doing so would have some surprising consequences though, e.g. u"abc" + 123 would turn out as u"abc123"... [Marc-Andre didn't have time to check this in before the deadline. I hope this is OK, Marc-Andre! You can still make changes and commit them on the trunk after the branch has been made, but then please mail Barry a context diff if you want the change to be merged into the 2.2b1 release branch. GvR]
* Patch #435971: UTF-7 codec by Brian Quinlan.Marc-André Lemburg2001-09-201-0/+18
|
* Fix for bug #462737.Marc-André Lemburg2001-09-191-3/+3
|
* Possibly the end of SF [#460020] bug or feature: unicode() and subclasses.Tim Peters2001-09-111-0/+2
| | | | | Changed unicode(i) to return a true Unicode object when i is an instance of a unicode subclass. Added PyUnicode_CheckExact macro.
* Make the Py<type>_Check() macro use PyObject_TypeCheck().Guido van Rossum2001-08-301-1/+1
|
* Patch #445762: Support --disable-unicodeMartin v. Löwis2001-08-171-0/+7
| | | | | | | | - Do not compile unicodeobject, unicodectype, and unicodedata if Unicode is disabled - check for Py_USING_UNICODE in all places that use Unicode functions - disables unicode literals, and the builtin functions - add the types.StringTypes list - remove Unicode literals from most tests.
* SF patch #438013 Remove 2-byte Py_UCS2 assumptionsTim Peters2001-08-091-6/+0
| | | | | | | | Removed all instances of Py_UCS2 from the codebase, and so also (I hope) the last remaining reliance on the platform having an integral type with exactly 16 bits. PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write one byte at a time.
* As discussed on python-dev: this patch adds name mangling toMarc-André Lemburg2001-07-311-0/+150
| | | | | assure that extensions and interpreters using the Unicode APIs were compiled using the same Unicode width.
* Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h.Jeremy Hylton2001-07-301-0/+17
| | | | | | | And remove all the extern decls in the middle of .c files. Apparently, it was excluded from the header file because it is intended for internal use by the interpreter. It's still intended for internal use and documented as such in the header file.
* removed "register const" from scalar arguments to the unicodeFredrik Lundh2001-06-271-15/+15
| | | | predicates
* use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZEFredrik Lundh2001-06-271-6/+7
| | | | tests.
* Encode surrogates in UTF-8 even for a wide Py_UNICODE.Martin v. Löwis2001-06-271-0/+3
| | | | | | | Implement sys.maxunicode. Explicitly wrap around upper/lower computations for wide Py_UNICODE. When decoding large characters with UTF-8, represent expected test results using the \U notation.
* Make Unicode work a bit better on Windows...Fredrik Lundh2001-06-261-0/+8
|
* Support using UCS-4 as the Py_UNICODE type:Martin v. Löwis2001-06-261-30/+23
| | | | | | | | | | Add configure option --enable-unicode. Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE, SIZEOF_WCHAR_T. Define Py_UCS2. Encode and decode large UTF-8 characters into single Py_UNICODE values for wide Unicode types; likewise for UTF-16. Remove test whether sizeof Py_UNICODE is two.
* experimental UCS-4 support: added USE_UCS4_STORAGE define toFredrik Lundh2001-06-261-12/+19
| | | | | | unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4). (this may be good enough for platforms that doesn't have a 16-bit type. the UTF-16 codecs don't work, though)
* This patch changes the behaviour of the UTF-16 codec family. Only theMarc-André Lemburg2001-05-211-4/+5
| | | | | | | | | UTF-16 codec will now interpret and remove a *leading* BOM mark. Sub- sequent BOM characters are no longer interpreted and removed. UTF-16-LE and -BE pass through all BOM mark characters. These changes should get the UTF-16 codec more in line with what the Unicode FAQ recommends w/r to BOM marks.
* This patch originated from an idea by Martin v. Loewis who submitted aMarc-André Lemburg2001-04-231-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | patch for sharing single character Unicode objects. Martin's patch had to be reworked in a number of ways to take Unicode resizing into consideration as well. Here's what the updated patch implements: * Single character Unicode strings in the Latin-1 range are shared (not only ASCII chars as in Martin's original patch). * The ASCII and Latin-1 codecs make use of this optimization, providing a noticable speedup for single character strings. Most Unicode methods can use the optimization as well (by virtue of using PyUnicode_FromUnicode()). * Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY) * The PyUnicode_Resize() can now also handle the case of resizing unicode_empty which previously resulted in an error. * Modified the internal API _PyUnicode_Resize() and the public PyUnicode_Resize() API to handle references to shared objects correctly. The _PyUnicode_Resize() signature changed due to this. * Callers of PyUnicode_FromUnicode() may now only modify the Unicode object contents of the returned object in case they called the API with NULL as content template. Note that even though this patch passes the regression tests, there may still be subtle bugs in the sharing code.
* Added #fndef's to avoid compiler errors.Marc-André Lemburg2000-08-111-1/+3
|
* This patch finalizes the move from UTF-8 to a default encoding inMarc-André Lemburg2000-08-031-2/+3
| | | | | | | | | | | | | | | | | | the Python Unicode implementation. The internal buffer used for implementing the buffer protocol is renamed to defenc to make this change visible. It now holds the default encoded version of the Unicode object and is calculated on demand (NULL otherwise). Since the default encoding defaults to ASCII, this will mean that Unicode objects which hold non-ASCII characters will no longer work on C APIs using the "s" or "t" parser markers. C APIs must now explicitly provide Unicode support via the "u", "U" or "es"/"es#" parser markers in order to work with non-ASCII Unicode strings. (Note: this patch will also have to be applied to the 1.6 branch of the CVS tree.)
* Changing the CNRI copyright notice according to CNRI's instructions.Guido van Rossum2000-08-031-1/+1
| | | | | This is a notice without a date, which apparently is not a claim to copyright but only advice to the reader. IANAL. :-)
* ANSIfications: fix empty arglists, and remove the checks forThomas Wouters2000-07-221-1/+1
| | | | 'HAVE_STDARG_PROTOTYPES' (consider it true, remove false branch)
* Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in eitherThomas Wouters2000-07-161-2/+2
| | | | | | | | | | comments, docstrings or error messages. I fixed two minor things in test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't"). There is a minor style issue involved: Guido seems to have preferred English grammar (behaviour, honour) in a couple places. This patch changes that to American, which is the more prominent style in the source. I prefer English myself, so if English is preferred, I'd be happy to supply a patch myself ;)
* Added new API PyUnicode_FromEncodedObject() which supports decodingMarc-André Lemburg2000-07-071-0/+18
| | | | | | objects including instance objects. The old API PyUnicode_FromObject() is still available as shortcut.
* Bill Tutt: Added Py_UCS4 typedef to hold UCS4 values (these needMarc-André Lemburg2000-07-071-0/+11
| | | | | at least 32 bits as opposed to Py_UNICODE which rely on having 16 bits).
* Modified the ISALPHA and ISALNUM macros to use the new lookup APIsMarc-André Lemburg2000-07-051-5/+8
| | | | from unicodectype.c
* Added new Py_UNICODE_ISALPHA() and Py_UNICODE_ISALNUM() macrosMarc-André Lemburg2000-07-031-0/+11
| | | | | | | | which are true for alphabetic and alphanumeric characters resp. The macros are currently implemented using the existing is* tables but will have to be updated to meet the Unicode standard definitions (add tables for non-cased letters and letter modifiers).
* Marc-Andre Lemburg <mal@lemburg.com>:Marc-André Lemburg2000-06-181-1/+2
| | | | | Added optimization proposed by Andrew Kuchling to the Unicode matching macro.
* M.-A. Lemburg <mal@lemburg.com>:Fred Drake2000-05-091-4/+26
| | | | | Added PyUnicode_GetDefaultEncoding() and PyUnicode_GetDefaultEncoding() APIs.
* Marc-Andre Lemburg:Guido van Rossum2000-04-111-1/+1
| | | | | | | Changed PyUnicode_Splitlines() maxsplit argument to keepends. The maxsplit functionality was replaced by the keepends functionality which allows keeping the line end markers together with the string.
* Marc-Andre Lemburg: New exported API PyUnicode_Resize().Guido van Rossum2000-04-101-0/+19
|
* Marc-Andre's third try at this bulk patch seems to work (except thatGuido van Rossum2000-04-051-2/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | his copy of test_contains.py seems to be broken -- the lines he deleted were already absent). Checkin messages: New Unicode support for int(), float(), complex() and long(). - new APIs PyInt_FromUnicode() and PyLong_FromUnicode() - added support for Unicode to PyFloat_FromString() - new encoding API PyUnicode_EncodeDecimal() which converts Unicode to a decimal char* string (used in the above new APIs) - shortcuts for calls like int(<int object>) and float(<float obj>) - tests for all of the above Unicode compares and contains checks: - comparing Unicode and non-string types now works; TypeErrors are masked, all other errors such as ValueError during Unicode coercion are passed through (note that PyUnicode_Compare does not implement the masking -- PyObject_Compare does this) - contains now works for non-string types too; TypeErrors are masked and 0 returned; all other errors are passed through Better testing support for the standard codecs. Misc minor enhancements, such as an alias dbcs for the mbcs codec. Changes: - PyLong_FromString() now applies the same error checks as does PyInt_FromString(): trailing garbage is reported as error and not longer silently ignored. The only characters which may be trailing the digits are 'L' and 'l' -- these are still silently ignored. - string.ato?() now directly interface to int(), long() and float(). The error strings are now a little different, but the type still remains the same. These functions are now ready to get declared obsolete ;-) - PyNumber_Int() now also does a check for embedded NULL chars in the input string; PyNumber_Long() already did this (and still does) Followed by: Looks like I've gone a step too far there... (and test_contains.py seem to have a bug too). I've changed back to reporting all errors in PyUnicode_Contains() and added a few more test cases to test_contains.py (plus corrected the join() NameError).
* Marc-Andre Lemburg:Guido van Rossum2000-03-281-1/+7
| | | | | | | | | | | | | | | The attached patch set includes a workaround to get Python with Unicode compile on BSDI 4.x (courtesy Thomas Wouters; the cause is a bug in the BSDI wchar.h header file) and Python interfaces for the MBCS codec donated by Mark Hammond. Also included are some minor corrections w/r to the docs of the new "es" and "es#" parser markers (use PyMem_Free() instead of free(); thanks to Mark Hammond for finding these). The unicodedata tests are now in a separate file (test_unicodedata.py) to avoid problems if the module cannot be found.
* Prototypes added for MBCS codecs. (Win32 only.)Guido van Rossum2000-03-281-0/+20
|
* On 17-Mar-2000, Marc-Andre Lemburg said:Barry Warsaw2000-03-201-7/+9
| | | | | | | | | | | | | Attached you find an update of the Unicode implementation. The patch is against the current CVS version. I would appreciate if someone with CVS checkin permissions could check the changes in. The patch contains all bugs and patches sent this week and also fixes a leak in the codecs code and a bug in the free list code for Unicode objects (which only shows up when compiling Python with Py_DEBUG; thanks to MarkH for spotting this one).
* Marc-Andre Lemburg: add declaration for PyUnicode_Contains().Guido van Rossum2000-03-131-0/+11
|
* Unicode implementation by Marc-Andre Lemburg based on original code by ↵Guido van Rossum2000-03-101-0/+754
Fredrik Lundh.