cpython.git - https://github.com/python/cpython.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Additional test and documentation for the unicode() changes.	Marc-André Lemburg	2001-10-19	1	-2/+3
\| \| \| \|	This patch should also be applied to the 2.2b1 trunk.
*	SF patch #470578: Fixes to synchronize unicode() and str()	Guido van Rossum	2001-10-19	1	-46/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements what we have discussed on python-dev late in September: str(obj) and unicode(obj) should behave similar, while the old behaviour is retained for unicode(obj, encoding, errors). The patch also adds a new feature with which objects can provide unicode(obj) with input data: the __unicode__ method. Currently no new tp_unicode slot is implemented; this is left as option for the future. Note that PyUnicode_FromEncodedObject() no longer accepts Unicode objects as input. The API name already suggests that Unicode objects do not belong in the list of acceptable objects and the functionality was only needed because PyUnicode_FromEncodedObject() was being used directly by unicode(). The latter was changed in the discussed way: * unicode(obj) calls PyObject_Unicode() * unicode(obj, encoding, errors) calls PyUnicode_FromEncodedObject() One thing left open to discussion is whether to leave the PyUnicode_FromObject() API as a thin API extension on top of PyUnicode_FromEncodedObject() or to turn it into a (macro) alias for PyObject_Unicode() and deprecate it. Doing so would have some surprising consequences though, e.g. u"abc" + 123 would turn out as u"abc123"... [Marc-Andre didn't have time to check this in before the deadline. I hope this is OK, Marc-Andre! You can still make changes and commit them on the trunk after the branch has been made, but then please mail Barry a context diff if you want the change to be merged into the 2.2b1 release branch. GvR]
*	Enable GC for new-style instances. This touches lots of files, since	Guido van Rossum	2001-10-05	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	many types were subclassable but had a xxx_dealloc function that called PyObject_DEL(self) directly instead of deferring to self->ob_type->tp_free(self). It is permissible to set tp_free in the type object directly to _PyObject_Del, for non-GC types, or to _PyObject_GC_Del, for GC types. Still, PyObject_DEL was a tad faster, so I'm fearing that our pystone rating is going down again. I'm not sure if doing something like void xxx_dealloc(PyObject *self) { if (PyXxxCheckExact(self)) PyObject_DEL(self); else self->ob_type->tp_free(self); } is any faster than always calling the else branch, so I haven't attempted that -- however those types whose own dealloc is fancier (int, float, unicode) do use this pattern.
*	Fix a bug in rendering of \\ by repr() -- it rendered as \\\ instead	Guido van Rossum	2001-09-21	1	-0/+1
\| \| \| \|	of \\.
*	Fix Unicode .join() method to raise a TypeError for sequence	Marc-André Lemburg	2001-09-20	1	-1/+11
\| \| \| \| \| \| \| \| \| \|	elements which are not Unicode objects or strings. (This matches the string.join() behaviour.) Fix a memory leak in the .join() method which occurs in case the Unicode resize fails. Restore the test_unicode output.
*	Implement the changes proposed in patch #413333. unicode(obj) now	Marc-André Lemburg	2001-09-20	1	-42/+55
\| \| \| \| \|	works just like str(obj) in that it tries __str__/tp_str on the object in case it finds that the object is not a string or buffer.
*	Patch #435971: UTF-7 codec by Brian Quinlan.	Marc-André Lemburg	2001-09-20	1	-0/+300
\|
*	str_subtype_new, unicode_subtype_new:	Tim Peters	2001-09-12	1	-10/+11
\| \| \| \| \| \| \| \|	+ These were leaving the hash fields at 0, which all string and unicode routines believe is a legitimate hash code. As a result, hash() applied to str and unicode subclass instances always returned 0, which in turn confused dict operations, etc. + Changed local names "new"; no point to antagonizing C++ compilers.
*	More on bug 460020: disable many optimizations of unicode subclasses.	Tim Peters	2001-09-12	1	-10/+11
\|
*	Possibly the end of SF [#460020] bug or feature: unicode() and subclasses.	Tim Peters	2001-09-11	1	-4/+12
\| \| \| \| \|	Changed unicode(i) to return a true Unicode object when i is an instance of a unicode subclass. Added PyUnicode_CheckExact macro.
*	PyUnicode_FromEncodedObject(): Repair memory leak in an error case.	Tim Peters	2001-09-11	1	-2/+2
\|
*	Make unicode subclassable.	Guido van Rossum	2001-08-30	1	-2/+32
\|
*	Patch #427190: Implement and use METH_NOARGS and METH_O.	Martin v. Löwis	2001-08-16	1	-116/+60
\|
*	SF patch #438013 Remove 2-byte Py_UCS2 assumptions	Tim Peters	2001-08-09	1	-76/+90
\| \| \| \| \| \| \| \|	Removed all instances of Py_UCS2 from the codebase, and so also (I hope) the last remaining reliance on the platform having an integral type with exactly 16 bits. PyUnicode_DecodeUTF16() and PyUnicode_EncodeUTF16() now read and write one byte at a time.
*	Merge of descr-branch back into trunk.	Tim Peters	2001-08-02	1	-9/+45
\|
*	Add _PyUnicode_AsDefaultEncodedString to unicodeobject.h.	Jeremy Hylton	2001-07-30	1	-14/+0
\| \| \| \| \| \| \|	And remove all the extern decls in the middle of .c files. Apparently, it was excluded from the header file because it is intended for internal use by the interpreter. It's still intended for internal use and documented as such in the header file.
*	Fix for bug #444493: u'\U00010001' segfaults with current CVS on	Marc-André Lemburg	2001-07-25	1	-6/+21
\| \| \| \|	wide builds.
*	Make the unicode-escape and the UTF-16 codecs handle surrogates	Marc-André Lemburg	2001-07-20	1	-24/+46
\| \| \| \| \| \| \| \|	correctly and thus roundtrip-safe. Some minor cleanups of the code. Added tests for the roundtrip-safety.
*	#ifdef out generation of \U escapes unless Py_UNICODE_WIDE. This	Guido van Rossum	2001-07-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	#caused warnings with the VMS C compiler. (SF bug #442998, in part.) On a narrow system the current code should never be executed since ch will always be < 0x10000. Marc-Andre: you may end up fixing this a different way, since I believe you have plans to generate \U for surrogate pairs. I'll leave that to you.
*	use Py_UNICODE_WIDE instead of USE_UCS4_STORAGE and Py_UNICODE_SIZE	Fredrik Lundh	2001-06-27	1	-4/+4
\| \| \| \|	tests.
*	Encode surrogates in UTF-8 even for a wide Py_UNICODE.	Martin v. Löwis	2001-06-27	1	-7/+12
\| \| \| \| \| \| \|	Implement sys.maxunicode. Explicitly wrap around upper/lower computations for wide Py_UNICODE. When decoding large characters with UTF-8, represent expected test results using the \U notation.
*	When decoding UTF-16, don't assume that the buffer is in native endianness	Martin v. Löwis	2001-06-26	1	-4/+4
\| \| \| \|	when checking surrogates.
*	Support using UCS-4 as the Py_UNICODE type:	Martin v. Löwis	2001-06-26	1	-30/+89
\| \| \| \| \| \| \| \| \| \|	Add configure option --enable-unicode. Add config.h macros Py_USING_UNICODE, PY_UNICODE_TYPE, Py_UNICODE_SIZE, SIZEOF_WCHAR_T. Define Py_UCS2. Encode and decode large UTF-8 characters into single Py_UNICODE values for wide Unicode types; likewise for UTF-16. Remove test whether sizeof Py_UNICODE is two.
*	experimental UCS-4 support: added USE_UCS4_STORAGE define to	Fredrik Lundh	2001-06-26	1	-0/+2
\| \| \| \| \| \|	unicodeobject.h, which forces sizeof(Py_UNICODE) == sizeof(Py_UCS4). (this may be good enough for platforms that doesn't have a 16-bit type. the UTF-16 codecs don't work, though)
*	experimental UCS-4 support: made compare a bit more robust, in case	Fredrik Lundh	2001-06-26	1	-11/+14
\| \| \| \| \|	sizeof(Py_UNICODE) >= sizeof(long). also changed surrogate expansion to work if sizeof(Py_UNICODE) > 2.
*	experimental UCS-4 support: don't assume that MS_WIN32 implies	Fredrik Lundh	2001-06-26	1	-2/+2
\| \| \| \|	HAVE_USABLE_WCHAR_T
*	Fix a mis-indentation in _PyUnicode_New() that caused me to stare at	Guido van Rossum	2001-06-14	1	-3/+3
\| \| \| \|	some code for longer than needed.
*	Fixes [ #430986 ] Buglet in PyUnicode_FromUnicode.	Marc-André Lemburg	2001-06-07	1	-1/+1
\|
*	fix bogus indentation	Jeremy Hylton	2001-05-29	1	-1/+1
\|
*	This patch changes the behaviour of the UTF-16 codec family. Only the	Marc-André Lemburg	2001-05-21	1	-17/+25
\| \| \| \| \| \| \| \| \|	UTF-16 codec will now interpret and remove a leading BOM mark. Sub- sequent BOM characters are no longer interpreted and removed. UTF-16-LE and -BE pass through all BOM mark characters. These changes should get the UTF-16 codec more in line with what the Unicode FAQ recommends w/r to BOM marks.
*	Remove unused variable	Jeremy Hylton	2001-05-08	1	-1/+0
\|
*	Make unicode.join() work nice with iterators. This also required a change	Tim Peters	2001-05-05	1	-11/+15
\| \| \| \| \| \| \| \|	to string.join(), so that when the latter figures out in midstream that it really needs unicode.join() instead, unicode.join() can actually get all the sequence elements (i.e., there's no guarantee that the sequence passed to string.join() can be iterated over again by unicode.join(), so string.join() must not pass on the original sequence object anymore).
*	A different approach to the problem reported in	Tim Peters	2001-04-28	1	-4/+15
\| \| \| \| \| \| \| \| \| \| \|	Patch #419651: Metrowerks on Mac adds 0x itself C std says %#x and %#X conversion of 0 do not add the 0x/0X base marker. Metrowerks apparently does. Mark Favas reported the same bug under a Compaq compiler on Tru64 Unix, but no other libc broken in this respect is known (known to be OK under MSVC and gcc). So just try the damn thing at runtime and see what the platform does. Note that we've always had bugs here, but never knew it before because a relevant test case didn't exist before 2.1.
*	This patch originated from an idea by Martin v. Loewis who submitted a	Marc-André Lemburg	2001-04-23	1	-51/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	patch for sharing single character Unicode objects. Martin's patch had to be reworked in a number of ways to take Unicode resizing into consideration as well. Here's what the updated patch implements: * Single character Unicode strings in the Latin-1 range are shared (not only ASCII chars as in Martin's original patch). * The ASCII and Latin-1 codecs make use of this optimization, providing a noticable speedup for single character strings. Most Unicode methods can use the optimization as well (by virtue of using PyUnicode_FromUnicode()). * Some code cleanup was done (replacing memcpy with Py_UNICODE_COPY) * The PyUnicode_Resize() can now also handle the case of resizing unicode_empty which previously resulted in an error. * Modified the internal API _PyUnicode_Resize() and the public PyUnicode_Resize() API to handle references to shared objects correctly. The _PyUnicode_Resize() signature changed due to this. * Callers of PyUnicode_FromUnicode() may now only modify the Unicode object contents of the returned object in case they called the API with NULL as content template. Note that even though this patch passes the regression tests, there may still be subtle bugs in the sharing code.
*	SF but #417587: compiler warnings compiling 2.1.	Tim Peters	2001-04-21	1	-3/+0
\| \| \| \|	Repaired some of the SGI compiler warnings Sjoerd Mullender reported.
*	CVS patch 416248: 2.1c1 unicodeobject: unused vrbl cleanup, from Mark Favas.	Tim Peters	2001-04-19	1	-2/+0
\|
*	Revert previous checkin, which caused test_unicodedata to fail.	Jeremy Hylton	2001-04-19	1	-33/+0
\|
*	Patch #416953: Cache ASCII characters to speed up ASCII decoding.	Martin v. Löwis	2001-04-18	1	-0/+33
\|
*	Bug 415514 reported that e.g.	Tim Peters	2001-04-12	1	-13/+19
\| \| \| \| \| \| \| \| \| \| \| \|	"%#x" % 0 blew up, at heart because C sprintf supplies a base marker if and only if the value is not 0. I then fixed that, by tolerating C's inconsistency when it does %#x, and taking away that Python produced 0x0 when formatting 0L (the "long" flavor of 0) under %#x itself. But after talking with Guido, we agreed it would be better to supply 0x for the short int case too, despite that it's inconsistent with C, because C is inconsistent with itself and with Python's hex(0) (plus, while "%#x" % 0 didn't work before, "%#x" % 0L did, and returned "0x0"). Similarly for %#X conversion.
*	Fix for SF bug #415514: "%#x" % 0 caused assertion failure/abort.	Tim Peters	2001-04-12	1	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	http://sourceforge.net/tracker/index.php?func=detail&aid=415514&group_id=5470&atid=105470 For short ints, Python defers to the platform C library to figure out what %#x should do. The code asserted that the platform C returned a string beginning with "0x". However, that's not true when-- and only when --the value being formatted is 0. Changed the code to live with C's inconsistency here. In the meantime, the problem does not arise if you format a long 0 (0L) instead. However, that's because the code we wrote to do %#x conversions on longs produces a leading "0x" regardless of value. That's probably wrong too: we should drop leading "0x", for consistency with C, when (& only when) formatting 0L. So I changed the long formatting code to do that too.
*	reorganized PyUnicode_DecodeUnicodeEscape a bit (in order to make it	Fredrik Lundh	2001-02-18	1	-110/+69
\| \| \| \|	less likely that bug #132817 ever appears again)
*	Fixed .capitalize() method of Unicode objects to work like the	Marc-André Lemburg	2001-01-29	1	-4/+18
\| \| \| \| \| \|	corresponding string method. Added tests for this too. Patch written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.
*	Show '\011', '\012', and '\015' as '\t', '\n', '\r' in strings.	Ka-Ping Yee	2001-01-24	1	-6/+19
\| \| \| \|	Switch from octal escapes to hex escapes for other nonprintable characters.
*	Move uchhash functionality into unicodedata (after the recent	Fredrik Lundh	2001-01-24	1	-9/+11
\| \| \| \| \|	crop of changes, the files are small enough to do this). Also adds "name" and "lookup" functions to unicodedata.
*	Better error message if ucnhash cannot be found (obscure attribute	Fredrik Lundh	2001-01-20	1	-3/+8
\| \| \| \| \| \|	errors aren't that helpful), or doesn't contain what's expected from it. Also tweaked the test script so it compiles even if ucnhash is missing.
*	refactored the unicodeobject/ucnhash interface, to hide the	Fredrik Lundh	2001-01-19	1	-103/+39
\| \| \| \| \| \| \|	implementation details inside the ucnhash module. also cleaned up the unicode copyright blurb a little; Secret Labs' internal revision history isn't that interesting...
*	This patch adds a new builtin unistr() which behaves like str()	Marc-André Lemburg	2001-01-17	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	except that it always returns Unicode objects. A new C API PyObject_Unicode() is also provided. This closes patch #101664. Written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.
*	Added checks to prevent PyUnicode_Count() from dumping core	Marc-André Lemburg	2001-01-16	1	-8/+19
\| \| \| \| \| \| \| \| \| \| \| \|	in case the parameters are out of bounds and fixes error handling for .count(), .startswith() and .endswith() for the case of mixed string/Unicode objects. This patch adds Python style index semantics to PyUnicode_Count() indices (including the special handling of negative indices). The patch is an extended version of patch #103249 submitted by Michael Hudson (mwh) on SF. It also includes new test cases.
*	This patch adds a new feature to the builtin charmap codec:	Marc-André Lemburg	2001-01-06	1	-8/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The mapping dictionaries can now contain 1-n mappings, meaning that character ordinals may be mapped to strings or Unicode object, e.g. 0x0078 ('x') -> u"abc", causing the ordinal to be replaced by the complete string or Unicode object instead of just one character. Another feature introduced by the patch is that of mapping oridnals to the emtpy string. This allows removing characters. The patch is different from patch #103100 in that it does not cause a performance hit for the normal use case of 1-1 mappings. Written by Marc-Andre Lemburg, copyright assigned to Guido van Rossum.
*	This patch changes the default behaviour of the builtin charmap	Marc-André Lemburg	2001-01-03	1	-13/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	codec to not apply Latin-1 mappings for keys which are not found in the mapping dictionaries, but instead treat them as undefined mappings. The patch was originally written by Martin v. Loewis with some additional (cosmetic) changes and an updated test script by Marc-Andre Lemburg. The standard codecs were recreated from the most current files available at the Unicode.org site using the Tools/scripts/gencodec.py tool. This patch closes the bugs #116285 and #119960.