From ec8147ba5548bac5cefa41f98517e52528a6a0bd Mon Sep 17 00:00:00 2001
From: Tim Peters <tim@python.org>
Date: Sat, 24 Aug 2013 15:15:19 -0500
Subject: Various clarifications based on feedback & questions over the years.
 (grafted from 23181bf411a16287a0a54e910fc0f9ecd2764bf0)

---
 Objects/listsort.txt | 115 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 96 insertions(+), 19 deletions(-)

diff --git a/Objects/listsort.txt b/Objects/listsort.txt
index 3ec4393..832e4f2 100644
--- a/Objects/listsort.txt
+++ b/Objects/listsort.txt
@@ -100,11 +100,13 @@ Comparison with Python's Samplesort Hybrid
   The algorithms are effectively identical in these cases, except that
   timsort does one less compare in \sort.
 
-  Now for the more interesting cases.  lg(n!) is the information-theoretic
-  limit for the best any comparison-based sorting algorithm can do on
-  average (across all permutations).  When a method gets significantly
-  below that, it's either astronomically lucky, or is finding exploitable
-  structure in the data.
+  Now for the more interesting cases.  Where lg(x) is the logarithm of x to
+  the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
+  the best any comparison-based sorting algorithm can do on average (across
+  all permutations).  When a method gets significantly below that, it's
+  either astronomically lucky, or is finding exploitable structure in the
+  data.
+
 
       n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
 -------  -------   ------   -------  -------  ------  -------  --------
@@ -251,7 +253,7 @@ Computing minrun
 ----------------
 If N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
 array then; it's hard to beat that given the overheads of trying something
-fancier.
+fancier (see note BINSORT).
 
 When N is a power of 2, testing on random data showed that minrun values of
 16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
@@ -379,10 +381,10 @@ with wildly unbalanced run lengths.
 
 Merge Memory
 ------------
-Merging adjacent runs of lengths A and B in-place is very difficult.
-Theoretical constructions are known that can do it, but they're too difficult
-and slow for practical use.  But if we have temp memory equal to min(A, B),
-it's easy.
+Merging adjacent runs of lengths A and B in-place, and in linear time, is
+difficult.  Theoretical constructions are known that can do it, but they're
+too difficult and slow for practical use.  But if we have temp memory equal
+to min(A, B), it's easy.
 
 If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
 and then we can do the obvious merge algorithm left to right, from the temp
@@ -457,10 +459,10 @@ finding the right spot early in B (more on that later).
 
 After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
 consecutive elements, and a straight binary search requires exactly k-1
-additional comparisons to nail it.  Then we copy all the B's up to that
-point in one chunk, and then copy A[0].  Note that no matter where A[0]
-belongs in B, the combination of galloping + binary search finds it in no
-more than about 2*lg(B) comparisons.
+additional comparisons to nail it (see note REGION OF UNCERTAINTY).  Then we
+copy all the B's up to that point in one chunk, and then copy A[0].  Note
+that no matter where A[0] belongs in B, the combination of galloping + binary
+search finds it in no more than about 2*lg(B) comparisons.
 
 If we did a straight binary search, we could find it in no more than
 ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
@@ -573,11 +575,11 @@ Galloping Complication
 The description above was for merge_lo.  merge_hi has to merge "from the
 other end", and really needs to gallop starting at the last element in a run
 instead of the first.  Galloping from the first still works, but does more
-comparisons than it should (this is significant -- I timed it both ways).
-For this reason, the gallop_left() and gallop_right() functions have a
-"hint" argument, which is the index at which galloping should begin.  So
-galloping can actually start at any index, and proceed at offsets of 1, 3,
-7, 15, ... or -1, -3, -7, -15, ... from the starting index.
+comparisons than it should (this is significant -- I timed it both ways). For
+this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
+functions have a "hint" argument, which is the index at which galloping
+should begin.  So galloping can actually start at any index, and proceed at
+offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
 
 In the code as I type it's always called with either 0 or n-1 (where n is
 the # of elements in a run).  It's tempting to try to do something fancier,
@@ -676,3 +678,78 @@ immediately.  The consequence is that it ends up using two compares to sort
 [2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
 taught how to deal with mixtures of ascending and descending runs
 efficiently in all cases.
+
+
+NOTES
+-----
+
+BINSORT
+A "binary insertion sort" is just like a textbook insertion sort, but instead
+of locating the correct position of the next item via linear (one at a time)
+search, an equivalent to Python's bisect.bisect_right is used to find the
+correct position in logarithmic time.  Most texts don't mention this
+variation, and those that do usually say it's not worth the bother:  insertion
+sort remains quadratic (expected and worst cases) either way.  Speeding the
+search doesn't reduce the quadratic data movement costs.
+
+But in CPython's case, comparisons are extraordinarily expensive compared to
+moving data, and the details matter.  Moving objects is just copying
+pointers.  Comparisons can be arbitrarily expensive (can invoke arbitary
+user-supplied Python code), but even in simple cases (like 3 < 4) _all_
+decisions are made at runtime:  what's the type of the left comparand?  the
+type of the right?  do they need to be coerced to a common type?  where's the
+code to compare these types?  And so on.  Even the simplest Python comparison
+triggers a large pile of C-level pointer dereferences, conditionals, and
+function calls.
+
+So cutting the number of compares is almost always measurably helpful in
+CPython, and the savings swamp the quadratic-time data movement costs for
+reasonable minrun values.
+
+
+LEFT OR RIGHT
+gallop_left() and gallop_right() are akin to the Python bisect module's
+bisect_left() and bisect_right():  they're the same unless the slice they're
+searching contains a (at least one) value equal to the value being searched
+for.  In that case, gallop_left() returns the position immediately before the
+leftmost equal value, and gallop_right() the position immediately after the
+rightmost equal value.  The distinction is needed to preserve stability.  In
+general, when merging adjacent runs A and B, gallop_left is used to search
+thru B for where an element from A belongs, and gallop_right to search thru A
+for where an element from B belongs.
+
+
+REGION OF UNCERTAINTY
+Two kinds of confusion seem to be common about the claim that after finding
+a k such that
+
+    B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
+
+then a binary search requires exactly k-1 tries to find A[0]'s proper
+location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
+
+The first confusion takes the form "OK, then the region of uncertainty is at
+indices 3, 4, 5, 6 and 7:  that's 5 elements, not the claimed 2**(k-1) - 1 =
+3"; or the region is viewed as a Python slice and the objection is "but that's
+the slice B[3:7], so has 7-3 = 4 elements".  Resolution:  we've already
+compared A[0] against B[3] and against B[7], so A[0]'s correct location is
+already known wrt _both_ endpoints.  What remains is to find A[0]'s correct
+location wrt B[4], B[5] and B[6], which spans 3 elements.  Or in general, the
+slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
+inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
+    (2**k-1)-1 - 2**(k-1) + 1 =
+    2**k-1 - 2**(k-1) =
+    2*2**k-1 - 2**(k-1) =
+    (2-1)*2**(k-1) - 1 =
+    2**(k-1) - 1
+elements.
+
+The second confusion:  "k-1 = 2 binary searches can find the correct location
+among 2**(k-1) = 4 elements, but you're only applying it to 3 elements:  we
+could make this more efficient by arranging for the region of uncertainty to
+span 2**(k-1) elements."  Resolution:  that confuses "elements" with
+"locations".  In a slice with N elements, there are N+1 _locations_.  In the
+example, with the region of uncertainty B[4], B[5], B[6], there are 4
+locations:  before B[4], between B[4] and B[5], between B[5] and B[6], and
+after B[6].  In general, across 2**(k-1)-1 elements, there are 2**(k-1)
+locations.  That's why k-1 binary searches are necessary and sufficient.
-- 
cgit v0.12