diff options
author | Greg Price <gnprice@gmail.com> | 2019-09-04 02:45:44 (GMT) |
---|---|---|
committer | Benjamin Peterson <benjamin@python.org> | 2019-09-04 02:45:44 (GMT) |
commit | 2f09413947d1ce0043de62ed2346f9a2b4e5880b (patch) | |
tree | 24a1f3b3e19d89925abd013531dc0f253606ed11 /Doc/whatsnew/3.8.rst | |
parent | 580bdb0ece681537eadb360f0c796123ead7a559 (diff) | |
download | cpython-2f09413947d1ce0043de62ed2346f9a2b4e5880b.zip cpython-2f09413947d1ce0043de62ed2346f9a2b4e5880b.tar.gz cpython-2f09413947d1ce0043de62ed2346f9a2b4e5880b.tar.bz2 |
closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.
With this, that case is actually faster than in master!
$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop
$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
Diffstat (limited to 'Doc/whatsnew/3.8.rst')
-rw-r--r-- | Doc/whatsnew/3.8.rst | 5 |
1 files changed, 3 insertions, 2 deletions
diff --git a/Doc/whatsnew/3.8.rst b/Doc/whatsnew/3.8.rst index bcdb60d..4a1362d 100644 --- a/Doc/whatsnew/3.8.rst +++ b/Doc/whatsnew/3.8.rst @@ -1090,8 +1090,9 @@ unicodedata <http://blog.unicode.org/2019/05/unicode-12-1-en.html>`_ release. * New function :func:`~unicodedata.is_normalized` can be used to verify a string - is in a specific normal form. (Contributed by Max Belanger and David Euresti in - :issue:`32285`). + is in a specific normal form, often much faster than by actually normalizing + the string. (Contributed by Max Belanger, David Euresti, and Greg Price in + :issue:`32285` and :issue:`37966`). unittest |