closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413947d1ce0043de62ed2346f9a2b4e5880b) Co-authored-by: Greg Price <gnprice@gmail.com>
author: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> 2019-09-04 03:03:37 (GMT)
committer: GitHub <noreply@github.com> 2019-09-04 03:03:37 (GMT)
commit: 4dd1c9d9c2bca4744c70c9556b7051f4465ede3e (patch)
tree: 3d6401fa900d729bc66dd0c057d3c47577442cc6 /Doc/whatsnew
parent: 952ea67289ffbd2f4785a9e537884a63d1208101 (diff)
download: cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.zip
cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.tar.gz
cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.tar.bz2
1 files changed, 3 insertions, 2 deletions
diff --git a/Doc/whatsnew/3.8.rst b/Doc/whatsnew/3.8.rst
index bcdb60d..4a1362d 100644
--- a/Doc/whatsnew/3.8.rst
+++ b/Doc/whatsnew/3.8.rst
@@ -1090,8 +1090,9 @@ unicodedata
   <http://blog.unicode.org/2019/05/unicode-12-1-en.html>`_ release.
 
 * New function :func:`~unicodedata.is_normalized` can be used to verify a string
-  is in a specific normal form. (Contributed by Max Belanger and David Euresti in
-  :issue:`32285`).
+  is in a specific normal form, often much faster than by actually normalizing
+  the string.  (Contributed by Max Belanger, David Euresti, and Greg Price in
+  :issue:`32285` and :issue:`37966`).
 
 
 unittest
author	Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>	2019-09-04 03:03:37 (GMT)
committer	GitHub <noreply@github.com>	2019-09-04 03:03:37 (GMT)
commit	4dd1c9d9c2bca4744c70c9556b7051f4465ede3e (patch)
tree	3d6401fa900d729bc66dd0c057d3c47577442cc6 /Doc/whatsnew
parent	952ea67289ffbd2f4785a9e537884a63d1208101 (diff)
download	cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.zip cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.tar.gz cpython-4dd1c9d9c2bca4744c70c9556b7051f4465ede3e.tar.bz2