summaryrefslogtreecommitdiffstats
path: root/Modules/unicodedata.c
Commit message (Collapse)AuthorAgeFilesLines
* closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. ↵Miss Islington (bot)2019-09-041-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413947d1ce0043de62ed2346f9a2b4e5880b) Co-authored-by: Greg Price <gnprice@gmail.com>
* bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async ↵Jeroen Demeyer2019-05-311-2/+2
| | | | | | | | | (GH-13464) Automatically replace tp_print -> tp_vectorcall_offset tp_compare -> tp_as_async tp_reserved -> tp_as_async
* bpo-36642: make unicodedata const (GH-12855)Inada Naoki2019-04-161-1/+1
|
* closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)Max Bélanger2018-11-041-17/+98
|
* bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)Wonsup Yoon2018-06-151-3/+7
| | | | | Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).
* update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)Benjamin Peterson2018-06-071-1/+1
| | | Also, standardize indentation of generated tables.
* Fix miscellaneous typos (#4275)luzpaz2017-11-051-1/+1
|
* bpo-30736: upgrade to Unicode 10.0 (#2344)Benjamin Peterson2017-06-231-2/+3
| | | Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
* Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*.Serhiy Storchaka2016-10-231-5/+2
|
* Add an extra byte for null in case we ever get very long unicode names.Christian Heimes2016-09-231-4/+4
|\
| * Add an extra byte for null in case we ever get very long unicode names.Christian Heimes2016-09-231-4/+4
| |
* | Unicode 9.0.0Benjamin Peterson2016-09-151-0/+3
| | | | | | | | | | Not completely mechanical since support for East Asian Width changes—emoji codepoints became Wide—had to be added to unicodedata.
* | Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()Christian Heimes2016-09-141-1/+1
|\ \ | |/
| * Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup()Christian Heimes2016-09-141-1/+1
| |
* | Issue #25923: Added the const qualifier to static constant arrays.Serhiy Storchaka2015-12-251-2/+2
|/
* upgrade to Unicode 8.0.0Benjamin Peterson2015-06-271-2/+3
|
* Issue #24000: Improved Argument Clinic's mapping of converters to legacyLarry Hastings2015-05-081-2/+2
| | | | "format units". Updated the documentation to match.
* Issue #24001: Argument Clinic converters now use accept={type}Larry Hastings2015-05-041-22/+22
| | | | instead of types={'type'} to specify the types the converter accepts.
* Issue #20181: Converted the unicodedata module to Argument Clinic.Serhiy Storchaka2015-04-171-227/+196
|
* Issue #23944: Argument Clinic now wraps long impl prototypes at column 78.Larry Hastings2015-04-141-2/+4
|
* Issue #23501: Argumen Clinic now generates code into separate files by default.Serhiy Storchaka2015-04-031-34/+3
|
* merge 3.3 (#23367)Benjamin Peterson2015-03-021-3/+10
|\
| * fix possible overflow bugs in unicodedata (closes #23367)Benjamin Peterson2015-03-021-3/+10
| |
* | Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integerSerhiy Storchaka2015-02-161-2/+2
| | | | | | | | overflows. Added few missed PyErr_NoMemory().
* | Issue #23181: More "codepoint" -> "code point".Serhiy Storchaka2015-01-181-7/+7
| |
* | Closes #21780: make the unicodedata module "ssize_t clean" for parsing ↵Victor Stinner2014-07-011-2/+8
| | | | | | | | parameters
* | Issue #20530: Argument Clinic's signature format has been revised again.Larry Hastings2014-02-091-2/+4
| | | | | | | | | | | | | | The new syntax is highly human readable while still preventing false positives. The syntax also extends Python syntax to denote "self" and positional-only parameters, allowing inspect.Signature objects to be totally accurate for all supported builtins in Python 3.4.
* | Issue #20326: Argument Clinic now uses a simple, unique signature toLarry Hastings2014-01-281-3/+3
| | | | | | | | | | | | | | | | | | | | annotate text signatures in docstrings, resulting in fewer false positives. "self" parameters are also explicitly marked, allowing inspect.Signature() to authoritatively detect (and skip) said parameters. Issue #20326: Argument Clinic now generates separate checksums for the input and output sections of the block, allowing external tools to verify that the input has not changed (and thus the output is not out-of-date).
* | Issue #20390: Small fixes and improvements for Argument Clinic.Larry Hastings2014-01-261-7/+6
| |
* | Issue #20189: Four additional builtin types (PyTypeObject,Larry Hastings2014-01-241-2/+2
| | | | | | | | | | | | PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type) have been modified to provide introspection information for builtins. Also: many additional Lib, test suite, and Argument Clinic fixes.
* | Issue #19273: The marker comments Argument Clinic uses have been changedLarry Hastings2014-01-071-6/+6
| | | | | | | | to improve readability.
* | Issue #20141: Improved Argument Clinic's support for the PyArg_Parse "O!"Larry Hastings2014-01-071-5/+5
| | | | | | | | format unit.
* | Issue #19674: inspect.signature() now produces a correct signatureLarry Hastings2013-11-231-5/+9
| | | | | | | | for some builtins.
* | Argument Clinic: rename "self" to "module" for module-level functions.Larry Hastings2013-11-181-11/+12
| |
* | Issue #16612: Add "Argument Clinic", a compile-time preprocessorLarry Hastings2013-10-191-13/+51
| | | | | | | | for C files to generate argument parsing code. (See PEP 436.)
* | merge 3.3Benjamin Peterson2013-10-111-1/+1
|\ \ | |/
| * replace hardcoded versionBenjamin Peterson2013-10-111-1/+1
| |
* | merge 3.3Benjamin Peterson2013-10-111-1/+1
|\ \ | |/
| * make sure the docstring is never out of date wrt unicode data versionBenjamin Peterson2013-10-111-1/+1
| |
* | merge 3.3 (#19220)Benjamin Peterson2013-10-101-3/+1
|\ \ | |/
| * remove url from docstring (closes #19220)Benjamin Peterson2013-10-101-2/+1
| |
* | upgrade unicode db to 6.3.0 (closes #19221)Benjamin Peterson2013-10-101-2/+2
|/
* #18803: fix more typos. Patch by Févry Thibault.Ezio Melotti2013-08-251-1/+1
|
* #18466: fix more typos. Patch by Févry Thibault.Ezio Melotti2013-08-171-1/+1
|
* #16681: merge with 3.2.Ezio Melotti2012-12-141-1/+1
|\
| * #16681: use "bidirectional class" instead of "bidirectional category" in the ↵Ezio Melotti2012-12-141-1/+1
| | | | | | | | docstring too.
* | Use C-style comments (required for the AIX build slave).Stefan Krah2012-09-231-2/+2
| |
* | Issue #14909: A number of places were using PyMem_Realloc() apis andKristjan Valur Jonsson2012-05-311-2/+5
| | | | | | | | | | PyObject_GC_Resize() with incorrect error handling. In case of errors, the original object would be leaked. This checkin fixes those cases.
* | update to Unicode 6.1Benjamin Peterson2012-02-211-1/+1
| |
* | #13379: merge with 3.2.Ezio Melotti2011-11-101-5/+6
|\ \ | |/