summaryrefslogtreecommitdiffstats
path: root/Modules/unicodedata.c
Commit message (Collapse)AuthorAgeFilesLines
* gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… ↵Miss Islington (bot)2023-02-061-1/+1
| | | | | | | (gh-101388) (cherry picked from commit 9ef7e75434587fc8f167d73eee5dd9bdca62714b) Co-authored-by: Dong-hee Na <donghee.na@python.org>
* bpo-43908: Make heap types converted during 3.10 alpha immutable (GH-26351) ↵Miss Islington (bot)2021-06-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (GH-26766) * Make functools types immutable * Multibyte codec types are now immutable * pyexpat.xmlparser is now immutable * array.arrayiterator is now immutable * _thread types are now immutable * _csv types are now immutable * _queue.SimpleQueue is now immutable * mmap.mmap is now immutable * unicodedata.UCD is now immutable * sqlite3 types are now immutable * _lsprof.Profiler is now immutable * _overlapped.Overlapped is now immutable * _operator types are now immutable * winapi__overlapped.Overlapped is now immutable * _lzma types are now immutable * _bz2 types are now immutable * _dbm.dbm and _gdbm.gdbm are now immutable (cherry picked from commit 00710e6346fd2394aa020b2dfae170093effac98) Co-authored-by: Erlend Egeberg Aasland <erlend.aasland@innova.no> Co-authored-by: Erlend Egeberg Aasland <erlend.aasland@innova.no>
* bpo-42972: Fully support GC for pyexpat, unicodedata, and dbm/gdbm heap ↵Miss Islington (bot)2021-05-271-3/+14
| | | | | | | | | | types (GH-26376) * bpo-42972: pyexpat * bpo-42972: unicodedata * bpo-42972: dbm/gdbm (cherry picked from commit 59af59c2dfa52dcd5605185263f266a49ced934c) Co-authored-by: Erlend Egeberg Aasland <erlend.aasland@innova.no>
* bpo-43916: Apply Py_TPFLAGS_DISALLOW_INSTANTIATION to selected types (GH-25748)Erlend Egeberg Aasland2021-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | Apply Py_TPFLAGS_DISALLOW_INSTANTIATION to the following types: * _dbm.dbm * _gdbm.gdbm * _multibytecodec.MultibyteCodec * _sre..SRE_Scanner * _thread._localdummy * _thread.lock * _winapi.Overlapped * array.arrayiterator * functools.KeyWrapper * functools._lru_list_elem * pyexpat.xmlparser * re.Match * re.Pattern * unicodedata.UCD * zlib.Compress * zlib.Decompress
* bpo-41798: Allocate unicodedata CAPI on the heap (GH-24128)Erlend Egeberg Aasland2021-01-201-8/+29
|
* bpo-42519: Replace PyObject_MALLOC() with PyObject_Malloc() (GH-23587)Victor Stinner2020-12-011-1/+1
| | | | | | | | | No longer use deprecated aliases to functions: * Replace PyObject_MALLOC() with PyObject_Malloc() * Replace PyObject_REALLOC() with PyObject_Realloc() * Replace PyObject_FREE() with PyObject_Free() * Replace PyObject_Del() with PyObject_Free() * Replace PyObject_DEL() with PyObject_Free()
* bpo-42157: Rename unicodedata.ucnhash_CAPI (GH-22994)Victor Stinner2020-10-271-2/+2
| | | | | | | Removed the unicodedata.ucnhash_CAPI attribute which was an internal PyCapsule object. The related private _PyUnicode_Name_CAPI structure was moved to the internal C API. Rename unicodedata.ucnhash_CAPI as unicodedata._ucnhash_CAPI.
* bpo-42157: Convert unicodedata.UCD to heap type (GH-22991)Victor Stinner2020-10-261-76/+44
| | | | | | | Convert the unicodedata extension module to the multiphase initialization API (PEP 489) and convert the unicodedata.UCD static type to a heap type. Co-Authored-By: Mohamed Koubaa <koubaa.m@gmail.com>
* bpo-42157: unicodedata avoids references to UCD_Type (GH-22990)Victor Stinner2020-10-261-105/+111
| | | | | | | | | | * UCD_Check() uses PyModule_Check() * Simplify the internal _PyUnicode_Name_CAPI structure: * Remove size and state members * Remove state and self parameters of getcode() and getname() functions * Remove global_module_state
* bpo-1635741: _PyUnicode_Name_CAPI moves to internal C API (GH-22713)Victor Stinner2020-10-261-13/+15
| | | | | | | | | | The private _PyUnicode_Name_CAPI structure of the PyCapsule API unicodedata.ucnhash_CAPI moves to the internal C API. Moreover, the structure gets a new state member which must be passed to the getcode() and getname() functions. * Move Include/ucnhash.h to Include/internal/pycore_ucnhash.h * unicodedata module is now built with Py_BUILD_CORE_MODULE. * unicodedata: move hashAPI variable into unicodedata_module_state.
* bpo-1635741: Add a global module state to unicodedata (GH-22712)Victor Stinner2020-10-151-54/+107
| | | | | | Prepare unicodedata to add a state per module: start with a global "module" state, pass it to subfunctions which access &UCD_Type. This change also prepares the conversion of the UCD_Type static type to a heap type.
* bpo-1635741, unicodedata: add ucd_type parameter to UCD_Check() macro (GH-22328)Mohamed Koubaa2020-09-231-13/+16
| | | Co-authored-by: Victor Stinner <vstinner@python.org>
* bpo-40268: Remove unused structmember.h includes (GH-19530)Victor Stinner2020-04-151-1/+1
| | | | | | If only offsetof() is needed: include stddef.h instead. When structmember.h is used, add a comment explaining that PyMemberDef is used.
* bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode ↵Serhiy Storchaka2020-04-111-3/+3
| | | | data. (GH-19345)
* bpo-39943: Remove unused self from find_nfc_index() (GH-18973)Andy Lester2020-03-171-4/+4
|
* closes bpo-39926: Update Unicode to 13.0.0. (GH-18910)Benjamin Peterson2020-03-111-4/+5
|
* bpo-39573: Clean up modules and headers to use Py_IS_TYPE() function (GH-18521)Dong-hee Na2020-02-171-1/+1
|
* bpo-39573: Add Py_SET_TYPE() function (GH-18394)Victor Stinner2020-02-071-1/+1
| | | Add Py_SET_TYPE() function to set the type of an object.
* bpo-37752: Delete redundant Py_CHARMASK in normalizestring() (GH-15095)Jordon Xu2019-09-101-2/+2
|
* bpo-38043: Use `bool` for boolean flags on is_normalized_quickcheck. (GH-15711)Greg Price2019-09-091-11/+11
|
* closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)Greg Price2019-09-041-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX #15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop
* bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async ↵Jeroen Demeyer2019-05-311-2/+2
| | | | | | | | | (GH-13464) Automatically replace tp_print -> tp_vectorcall_offset tp_compare -> tp_as_async tp_reserved -> tp_as_async
* bpo-36642: make unicodedata const (GH-12855)Inada Naoki2019-04-161-1/+1
|
* closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)Max Bélanger2018-11-041-17/+98
|
* bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)Wonsup Yoon2018-06-151-3/+7
| | | | | Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).
* update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)Benjamin Peterson2018-06-071-1/+1
| | | Also, standardize indentation of generated tables.
* Fix miscellaneous typos (#4275)luzpaz2017-11-051-1/+1
|
* bpo-30736: upgrade to Unicode 10.0 (#2344)Benjamin Peterson2017-06-231-2/+3
| | | Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
* Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*.Serhiy Storchaka2016-10-231-5/+2
|
* Add an extra byte for null in case we ever get very long unicode names.Christian Heimes2016-09-231-4/+4
|\
| * Add an extra byte for null in case we ever get very long unicode names.Christian Heimes2016-09-231-4/+4
| |
* | Unicode 9.0.0Benjamin Peterson2016-09-151-0/+3
| | | | | | | | | | Not completely mechanical since support for East Asian Width changes—emoji codepoints became Wide—had to be added to unicodedata.
* | Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()Christian Heimes2016-09-141-1/+1
|\ \ | |/
| * Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()Christian Heimes2016-09-141-1/+1
| |
* | Issue #25923: Added the const qualifier to static constant arrays.Serhiy Storchaka2015-12-251-2/+2
|/
* upgrade to Unicode 8.0.0Benjamin Peterson2015-06-271-2/+3
|
* Issue #24000: Improved Argument Clinic's mapping of converters to legacyLarry Hastings2015-05-081-2/+2
| | | | "format units". Updated the documentation to match.
* Issue #24001: Argument Clinic converters now use accept={type}Larry Hastings2015-05-041-22/+22
| | | | instead of types={'type'} to specify the types the converter accepts.
* Issue #20181: Converted the unicodedata module to Argument Clinic.Serhiy Storchaka2015-04-171-227/+196
|
* Issue #23944: Argument Clinic now wraps long impl prototypes at column 78.Larry Hastings2015-04-141-2/+4
|
* Issue #23501: Argumen Clinic now generates code into separate files by default.Serhiy Storchaka2015-04-031-34/+3
|
* merge 3.3 (#23367)Benjamin Peterson2015-03-021-3/+10
|\
| * fix possible overflow bugs in unicodedata (closes #23367)Benjamin Peterson2015-03-021-3/+10
| |
* | Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integerSerhiy Storchaka2015-02-161-2/+2
| | | | | | | | overflows. Added few missed PyErr_NoMemory().
* | Issue #23181: More "codepoint" -> "code point".Serhiy Storchaka2015-01-181-7/+7
| |
* | Closes #21780: make the unicodedata module "ssize_t clean" for parsing ↵Victor Stinner2014-07-011-2/+8
| | | | | | | | parameters
* | Issue #20530: Argument Clinic's signature format has been revised again.Larry Hastings2014-02-091-2/+4
| | | | | | | | | | | | | | The new syntax is highly human readable while still preventing false positives. The syntax also extends Python syntax to denote "self" and positional-only parameters, allowing inspect.Signature objects to be totally accurate for all supported builtins in Python 3.4.
* | Issue #20326: Argument Clinic now uses a simple, unique signature toLarry Hastings2014-01-281-3/+3
| | | | | | | | | | | | | | | | | | | | annotate text signatures in docstrings, resulting in fewer false positives. "self" parameters are also explicitly marked, allowing inspect.Signature() to authoritatively detect (and skip) said parameters. Issue #20326: Argument Clinic now generates separate checksums for the input and output sections of the block, allowing external tools to verify that the input has not changed (and thus the output is not out-of-date).
* | Issue #20390: Small fixes and improvements for Argument Clinic.Larry Hastings2014-01-261-7/+6
| |
* | Issue #20189: Four additional builtin types (PyTypeObject,Larry Hastings2014-01-241-2/+2
| | | | | | | | | | | | PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type) have been modified to provide introspection information for builtins. Also: many additional Lib, test suite, and Argument Clinic fixes.