Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. ↵ | Miss Islington (bot) | 2019-09-04 | 1 | -24/+51 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413947d1ce0043de62ed2346f9a2b4e5880b) Co-authored-by: Greg Price <gnprice@gmail.com> | ||||
* | bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async ↵ | Jeroen Demeyer | 2019-05-31 | 1 | -2/+2 |
| | | | | | | | | | (GH-13464) Automatically replace tp_print -> tp_vectorcall_offset tp_compare -> tp_as_async tp_reserved -> tp_as_async | ||||
* | bpo-36642: make unicodedata const (GH-12855) | Inada Naoki | 2019-04-16 | 1 | -1/+1 |
| | |||||
* | closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) | Max Bélanger | 2018-11-04 | 1 | -17/+98 |
| | |||||
* | bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) | Wonsup Yoon | 2018-06-15 | 1 | -3/+7 |
| | | | | | Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). | ||||
* | update to Unicode 11.0.0 (closes bpo-33778) (GH-7439) | Benjamin Peterson | 2018-06-07 | 1 | -1/+1 |
| | | | Also, standardize indentation of generated tables. | ||||
* | Fix miscellaneous typos (#4275) | luzpaz | 2017-11-05 | 1 | -1/+1 |
| | |||||
* | bpo-30736: upgrade to Unicode 10.0 (#2344) | Benjamin Peterson | 2017-06-23 | 1 | -2/+3 |
| | | | Straightforward. While we're at it, though, strip trailing whitespace from generated tables. | ||||
* | Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*. | Serhiy Storchaka | 2016-10-23 | 1 | -5/+2 |
| | |||||
* | Add an extra byte for null in case we ever get very long unicode names. | Christian Heimes | 2016-09-23 | 1 | -4/+4 |
|\ | |||||
| * | Add an extra byte for null in case we ever get very long unicode names. | Christian Heimes | 2016-09-23 | 1 | -4/+4 |
| | | |||||
* | | Unicode 9.0.0 | Benjamin Peterson | 2016-09-15 | 1 | -0/+3 |
| | | | | | | | | | | Not completely mechanical since support for East Asian Width changes—emoji codepoints became Wide—had to be added to unicodedata. | ||||
* | | Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() | Christian Heimes | 2016-09-14 | 1 | -1/+1 |
|\ \ | |/ | |||||
| * | Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() | Christian Heimes | 2016-09-14 | 1 | -1/+1 |
| | | |||||
* | | Issue #25923: Added the const qualifier to static constant arrays. | Serhiy Storchaka | 2015-12-25 | 1 | -2/+2 |
|/ | |||||
* | upgrade to Unicode 8.0.0 | Benjamin Peterson | 2015-06-27 | 1 | -2/+3 |
| | |||||
* | Issue #24000: Improved Argument Clinic's mapping of converters to legacy | Larry Hastings | 2015-05-08 | 1 | -2/+2 |
| | | | | "format units". Updated the documentation to match. | ||||
* | Issue #24001: Argument Clinic converters now use accept={type} | Larry Hastings | 2015-05-04 | 1 | -22/+22 |
| | | | | instead of types={'type'} to specify the types the converter accepts. | ||||
* | Issue #20181: Converted the unicodedata module to Argument Clinic. | Serhiy Storchaka | 2015-04-17 | 1 | -227/+196 |
| | |||||
* | Issue #23944: Argument Clinic now wraps long impl prototypes at column 78. | Larry Hastings | 2015-04-14 | 1 | -2/+4 |
| | |||||
* | Issue #23501: Argumen Clinic now generates code into separate files by default. | Serhiy Storchaka | 2015-04-03 | 1 | -34/+3 |
| | |||||
* | merge 3.3 (#23367) | Benjamin Peterson | 2015-03-02 | 1 | -3/+10 |
|\ | |||||
| * | fix possible overflow bugs in unicodedata (closes #23367) | Benjamin Peterson | 2015-03-02 | 1 | -3/+10 |
| | | |||||
* | | Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integer | Serhiy Storchaka | 2015-02-16 | 1 | -2/+2 |
| | | | | | | | | overflows. Added few missed PyErr_NoMemory(). | ||||
* | | Issue #23181: More "codepoint" -> "code point". | Serhiy Storchaka | 2015-01-18 | 1 | -7/+7 |
| | | |||||
* | | Closes #21780: make the unicodedata module "ssize_t clean" for parsing ↵ | Victor Stinner | 2014-07-01 | 1 | -2/+8 |
| | | | | | | | | parameters | ||||
* | | Issue #20530: Argument Clinic's signature format has been revised again. | Larry Hastings | 2014-02-09 | 1 | -2/+4 |
| | | | | | | | | | | | | | | The new syntax is highly human readable while still preventing false positives. The syntax also extends Python syntax to denote "self" and positional-only parameters, allowing inspect.Signature objects to be totally accurate for all supported builtins in Python 3.4. | ||||
* | | Issue #20326: Argument Clinic now uses a simple, unique signature to | Larry Hastings | 2014-01-28 | 1 | -3/+3 |
| | | | | | | | | | | | | | | | | | | | | annotate text signatures in docstrings, resulting in fewer false positives. "self" parameters are also explicitly marked, allowing inspect.Signature() to authoritatively detect (and skip) said parameters. Issue #20326: Argument Clinic now generates separate checksums for the input and output sections of the block, allowing external tools to verify that the input has not changed (and thus the output is not out-of-date). | ||||
* | | Issue #20390: Small fixes and improvements for Argument Clinic. | Larry Hastings | 2014-01-26 | 1 | -7/+6 |
| | | |||||
* | | Issue #20189: Four additional builtin types (PyTypeObject, | Larry Hastings | 2014-01-24 | 1 | -2/+2 |
| | | | | | | | | | | | | PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type) have been modified to provide introspection information for builtins. Also: many additional Lib, test suite, and Argument Clinic fixes. | ||||
* | | Issue #19273: The marker comments Argument Clinic uses have been changed | Larry Hastings | 2014-01-07 | 1 | -6/+6 |
| | | | | | | | | to improve readability. | ||||
* | | Issue #20141: Improved Argument Clinic's support for the PyArg_Parse "O!" | Larry Hastings | 2014-01-07 | 1 | -5/+5 |
| | | | | | | | | format unit. | ||||
* | | Issue #19674: inspect.signature() now produces a correct signature | Larry Hastings | 2013-11-23 | 1 | -5/+9 |
| | | | | | | | | for some builtins. | ||||
* | | Argument Clinic: rename "self" to "module" for module-level functions. | Larry Hastings | 2013-11-18 | 1 | -11/+12 |
| | | |||||
* | | Issue #16612: Add "Argument Clinic", a compile-time preprocessor | Larry Hastings | 2013-10-19 | 1 | -13/+51 |
| | | | | | | | | for C files to generate argument parsing code. (See PEP 436.) | ||||
* | | merge 3.3 | Benjamin Peterson | 2013-10-11 | 1 | -1/+1 |
|\ \ | |/ | |||||
| * | replace hardcoded version | Benjamin Peterson | 2013-10-11 | 1 | -1/+1 |
| | | |||||
* | | merge 3.3 | Benjamin Peterson | 2013-10-11 | 1 | -1/+1 |
|\ \ | |/ | |||||
| * | make sure the docstring is never out of date wrt unicode data version | Benjamin Peterson | 2013-10-11 | 1 | -1/+1 |
| | | |||||
* | | merge 3.3 (#19220) | Benjamin Peterson | 2013-10-10 | 1 | -3/+1 |
|\ \ | |/ | |||||
| * | remove url from docstring (closes #19220) | Benjamin Peterson | 2013-10-10 | 1 | -2/+1 |
| | | |||||
* | | upgrade unicode db to 6.3.0 (closes #19221) | Benjamin Peterson | 2013-10-10 | 1 | -2/+2 |
|/ | |||||
* | #18803: fix more typos. Patch by Févry Thibault. | Ezio Melotti | 2013-08-25 | 1 | -1/+1 |
| | |||||
* | #18466: fix more typos. Patch by Févry Thibault. | Ezio Melotti | 2013-08-17 | 1 | -1/+1 |
| | |||||
* | #16681: merge with 3.2. | Ezio Melotti | 2012-12-14 | 1 | -1/+1 |
|\ | |||||
| * | #16681: use "bidirectional class" instead of "bidirectional category" in the ↵ | Ezio Melotti | 2012-12-14 | 1 | -1/+1 |
| | | | | | | | | docstring too. | ||||
* | | Use C-style comments (required for the AIX build slave). | Stefan Krah | 2012-09-23 | 1 | -2/+2 |
| | | |||||
* | | Issue #14909: A number of places were using PyMem_Realloc() apis and | Kristjan Valur Jonsson | 2012-05-31 | 1 | -2/+5 |
| | | | | | | | | | | PyObject_GC_Resize() with incorrect error handling. In case of errors, the original object would be leaked. This checkin fixes those cases. | ||||
* | | update to Unicode 6.1 | Benjamin Peterson | 2012-02-21 | 1 | -1/+1 |
| | | |||||
* | | #13379: merge with 3.2. | Ezio Melotti | 2011-11-10 | 1 | -5/+6 |
|\ \ | |/ |