From ec8147ba5548bac5cefa41f98517e52528a6a0bd Mon Sep 17 00:00:00 2001 From: Tim Peters Date: Sat, 24 Aug 2013 15:15:19 -0500 Subject: Various clarifications based on feedback & questions over the years. (grafted from 23181bf411a16287a0a54e910fc0f9ecd2764bf0) --- Objects/listsort.txt | 115 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 96 insertions(+), 19 deletions(-) diff --git a/Objects/listsort.txt b/Objects/listsort.txt index 3ec4393..832e4f2 100644 --- a/Objects/listsort.txt +++ b/Objects/listsort.txt @@ -100,11 +100,13 @@ Comparison with Python's Samplesort Hybrid The algorithms are effectively identical in these cases, except that timsort does one less compare in \sort. - Now for the more interesting cases. lg(n!) is the information-theoretic - limit for the best any comparison-based sorting algorithm can do on - average (across all permutations). When a method gets significantly - below that, it's either astronomically lucky, or is finding exploitable - structure in the data. + Now for the more interesting cases. Where lg(x) is the logarithm of x to + the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for + the best any comparison-based sorting algorithm can do on average (across + all permutations). When a method gets significantly below that, it's + either astronomically lucky, or is finding exploitable structure in the + data. + n lg(n!) *sort 3sort +sort %sort ~sort !sort ------- ------- ------ ------- ------- ------ ------- -------- @@ -251,7 +253,7 @@ Computing minrun ---------------- If N < 64, minrun is N. IOW, binary insertion sort is used for the whole array then; it's hard to beat that given the overheads of trying something -fancier. +fancier (see note BINSORT). When N is a power of 2, testing on random data showed that minrun values of 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost @@ -379,10 +381,10 @@ with wildly unbalanced run lengths. Merge Memory ------------ -Merging adjacent runs of lengths A and B in-place is very difficult. -Theoretical constructions are known that can do it, but they're too difficult -and slow for practical use. But if we have temp memory equal to min(A, B), -it's easy. +Merging adjacent runs of lengths A and B in-place, and in linear time, is +difficult. Theoretical constructions are known that can do it, but they're +too difficult and slow for practical use. But if we have temp memory equal +to min(A, B), it's easy. If A is smaller (function merge_lo), copy A to a temp array, leave B alone, and then we can do the obvious merge algorithm left to right, from the temp @@ -457,10 +459,10 @@ finding the right spot early in B (more on that later). After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 consecutive elements, and a straight binary search requires exactly k-1 -additional comparisons to nail it. Then we copy all the B's up to that -point in one chunk, and then copy A[0]. Note that no matter where A[0] -belongs in B, the combination of galloping + binary search finds it in no -more than about 2*lg(B) comparisons. +additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we +copy all the B's up to that point in one chunk, and then copy A[0]. Note +that no matter where A[0] belongs in B, the combination of galloping + binary +search finds it in no more than about 2*lg(B) comparisons. If we did a straight binary search, we could find it in no more than ceiling(lg(B+1)) comparisons -- but straight binary search takes that many @@ -573,11 +575,11 @@ Galloping Complication The description above was for merge_lo. merge_hi has to merge "from the other end", and really needs to gallop starting at the last element in a run instead of the first. Galloping from the first still works, but does more -comparisons than it should (this is significant -- I timed it both ways). -For this reason, the gallop_left() and gallop_right() functions have a -"hint" argument, which is the index at which galloping should begin. So -galloping can actually start at any index, and proceed at offsets of 1, 3, -7, 15, ... or -1, -3, -7, -15, ... from the starting index. +comparisons than it should (this is significant -- I timed it both ways). For +this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) +functions have a "hint" argument, which is the index at which galloping +should begin. So galloping can actually start at any index, and proceed at +offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index. In the code as I type it's always called with either 0 or n-1 (where n is the # of elements in a run). It's tempting to try to do something fancier, @@ -676,3 +678,78 @@ immediately. The consequence is that it ends up using two compares to sort [2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be taught how to deal with mixtures of ascending and descending runs efficiently in all cases. + + +NOTES +----- + +BINSORT +A "binary insertion sort" is just like a textbook insertion sort, but instead +of locating the correct position of the next item via linear (one at a time) +search, an equivalent to Python's bisect.bisect_right is used to find the +correct position in logarithmic time. Most texts don't mention this +variation, and those that do usually say it's not worth the bother: insertion +sort remains quadratic (expected and worst cases) either way. Speeding the +search doesn't reduce the quadratic data movement costs. + +But in CPython's case, comparisons are extraordinarily expensive compared to +moving data, and the details matter. Moving objects is just copying +pointers. Comparisons can be arbitrarily expensive (can invoke arbitary +user-supplied Python code), but even in simple cases (like 3 < 4) _all_ +decisions are made at runtime: what's the type of the left comparand? the +type of the right? do they need to be coerced to a common type? where's the +code to compare these types? And so on. Even the simplest Python comparison +triggers a large pile of C-level pointer dereferences, conditionals, and +function calls. + +So cutting the number of compares is almost always measurably helpful in +CPython, and the savings swamp the quadratic-time data movement costs for +reasonable minrun values. + + +LEFT OR RIGHT +gallop_left() and gallop_right() are akin to the Python bisect module's +bisect_left() and bisect_right(): they're the same unless the slice they're +searching contains a (at least one) value equal to the value being searched +for. In that case, gallop_left() returns the position immediately before the +leftmost equal value, and gallop_right() the position immediately after the +rightmost equal value. The distinction is needed to preserve stability. In +general, when merging adjacent runs A and B, gallop_left is used to search +thru B for where an element from A belongs, and gallop_right to search thru A +for where an element from B belongs. + + +REGION OF UNCERTAINTY +Two kinds of confusion seem to be common about the claim that after finding +a k such that + + B[2**(k-1) - 1] < A[0] <= B[2**k - 1] + +then a binary search requires exactly k-1 tries to find A[0]'s proper +location. For concreteness, say k=3, so B[3] < A[0] <= B[7]. + +The first confusion takes the form "OK, then the region of uncertainty is at +indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 = +3"; or the region is viewed as a Python slice and the objection is "but that's +the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already +compared A[0] against B[3] and against B[7], so A[0]'s correct location is +already known wrt _both_ endpoints. What remains is to find A[0]'s correct +location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the +slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1 +inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has + (2**k-1)-1 - 2**(k-1) + 1 = + 2**k-1 - 2**(k-1) = + 2*2**k-1 - 2**(k-1) = + (2-1)*2**(k-1) - 1 = + 2**(k-1) - 1 +elements. + +The second confusion: "k-1 = 2 binary searches can find the correct location +among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we +could make this more efficient by arranging for the region of uncertainty to +span 2**(k-1) elements." Resolution: that confuses "elements" with +"locations". In a slice with N elements, there are N+1 _locations_. In the +example, with the region of uncertainty B[4], B[5], B[6], there are 4 +locations: before B[4], between B[4] and B[5], between B[5] and B[6], and +after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) +locations. That's why k-1 binary searches are necessary and sufficient. -- cgit v0.12