Let’s take a look at a simple example of sorting some numbers into ascending order based on a user-provided comparison function. A consistent comparison function returns -1 (or any other negative value), 0, or 1 (or any other positive value) when the two provided values are either smaller, equal, or greater respectively. A comparison function that does not follow this pattern is inconsistent and can have arbitrary side-effects, such as modifying the array it’s intended to sort.
Even in the next example, calls to user code may happen. The “default” comparison function calls toString on both values and does a lexicographical comparison on the string representations.
More fun with accessors and prototype-chain interactions
This is the part where we leave the spec behind and venture into “implementation-defined” behavior land. The spec has a whole list of conditions that, when met, allow the engine to sort the object/array as it sees fit — or not at all. Engines still have to follow some ground rules but everything else is pretty much up in the air. On the one hand, this gives engine developers the freedom to experiment with different implementations. On the other hand, users expect some reasonable behavior even though the spec doesn’t require there to be any. This is further complicated by the fact that “reasonable behavior” is not always straightforward to determine.
This section shows that there are still some aspects of Array#sort where engine behavior differs greatly. These are hard edge cases, and as mentioned above it’s not always clear what “the right thing to do” actually is. We highly recommend not writing code like this; engines won’t optimize for it.
Here’s the output of that snippet in various engines. Note that there are no “right” or “wrong” answers here — the spec leaves this up to the implementation!
The next example shows interactions with the prototype chain. For the sake of brevity we don’t show the call log.
The output shows the
object after it’s sorted. Again, there is no right answer here. This example just shows how weird the interaction between indexed properties and the prototype chain can get:
What V8 does before actually sorting
V8 has two pre-processing steps before it actually sorts anything. First, if the object to sort has holes and elements on the prototype chain, they are copied from the prototype chain to the object itself. This frees us from caring about the prototype chain during all remaining steps. This is currently only done for non-JSArrays but other engines do it for JSArrays as well.
The second pre-processing step is the removal of holes. All elements in the sort-range are moved to the beginning of the object. undefineds are moved after that. This is even required by the spec to some degree as it requires us to always sort undefineds to the end. The result is that a user-provided comparison function will never get called with an undefined argument. After the second pre-processing step the sorting algorithm only needs to consider non-undefineds, potentially reducing the number of elements it actually has to sort.
Choosing a suitable pivot element has a big impact when it comes to Quicksort. V8 employed two strategies:
- The pivot was chosen as the median of the first, last, and a third element of the sub-array that gets sorted. For smaller arrays that third element is simply the middle element.
- For larger arrays a sample was taken, then sorted and the median of the sorted sample served as the third element in the above calculation.
One of the advantages of Quicksort is that it sorts in-place. The memory overhead comes from allocating a small array for the sample when sorting large arrays, and log(n) stack space. The downside is that it’s not a stable algorithm and there’s a chance the algorithm hits the worst-case scenario where QuickSort degrades to O(n^2).
快速排序的优势是其效率不错。内存消耗来源于对大数组进行排序时候取样需要的一个小数组，大小为 log(n) 的栈空间。劣势是其排序表型不够稳定，当遇到最差的应用场景的时候，其效率会降低到 O(n^2)。
Introducing V8 Torque
As an avid reader of the V8 blog you might have heard of CodeStubAssembler or CSA for short. CSA is a V8 component that allows us to write low-level TurboFan IR directly in C++ that later gets translated to machine code for the appropriate architecture using TurboFan’s backend.
The downside of CSA is that it really can be considered an assembly language. Control-flow is modeled using explicit labels and gotos, which makes implementing more complex algorithms in CSA hard to read and error-prone.
Enter V8 Torque. Torque is a domain-specific language with TypeScript-like syntax that currently uses CSA as its sole compilation target. Torque allows nearly the same level of control as CSA does while at the same time offering higher-level constructs such as while and for loops. Additionally, it’s strongly typed and will in the future contain security checks such as automatic out-of-bound checks providing V8 engineers with stronger guarantees.
The first major builtins that were re-written in V8 Torque were TypedArray#sort and Dataview operations. Both served the additional purpose of providing feedback to the Torque developers on what languages features are needed and idioms should be used to write builtins efficiently. At the time of writing, several
Array#unshift) while others were completely re-written (e.g.
Moving Array#sort to Torque
This worked reasonably well, but as it still utilized Quicksort, Array#sort remained unstable. The request for a stable Array#sort is among the oldest tickets in V8’s bug tracker. Experimenting with Timsort as a next step offered us multiple things. First, we like that it’s stable and offers some nice algorithmic guarantees (see next section). Second, Torque was still a work-in-progress and implementing a more complex builtin such as Array#sort with Timsort resulted in lots of actionable feedback influencing Torque as a language.
Timsort, initially developed by Tim Peters for Python in 2002, could best be described as an adaptive stable Mergesort variant. Even though the details are rather complex and are best described by the man himself or the Wikipedia page, the basics are easy to understand. While Mergesort usually works in recursive fashion, Timsort works iteratively. It processes an array from left to right and looks for so-called runs. A run is simply a sequence that is already sorted. This includes sequences that are sorted “the wrong way” as these sequences can simply be reversed to form a run. At the start of the sorting process a minimum run length is determined that depends on the length of the input. If Timsort can’t find natural runs of this minimum run length a run is “boosted artificially” using Insertion Sort.
Runs that are found this way are tracked using a stack that remembers a starting index and a length of each run. From time to time runs on the stack are merged together until only one sorted run remains. Timsort tries to maintain a balance when it comes to deciding which runs to merge. On the one hand you want to try and merge early as the data of those runs has a high chance of already being in the cache, on the other hand you want to merge as late as possible to take advantage of patterns in the data that might emerge. To accomplish this, Timsort maintains two invariants. Assuming
C are the three top-most runs:
|C| > |B| + |A|
|B| > |A|
The image shows the case where
|A| > |B| so
B is merged with the smaller of the two runs.
Note that Timsort only merges consecutive runs, this is needed to maintain stability, otherwise equal elements would be transferred between runs. Also the first invariant makes sure that run lengths grow at least as fast as the Fibonacci numbers, giving an upper bound on the size of the run stack when we know the maximum array length.
One can now see that already-sorted sequences are sorted in O(n) as such an array would result in a single run that does not need to get merged. The worst case is O(n log n). These algorithmic properties together with the stable nature of Timsort were a few of the reasons why we chose Timsort over Quicksort in the end.
Implementing Timsort in Torque
Builtins usually have different code-paths that are chosen during runtime depending on various variables. The most generic version can handle any kind of object, regardless if its a
JSProxy, has interceptors or needs to do prototype chain lookups when retrieving or setting properties.
The generic path is rather slow in most cases, as it needs to account for all eventualities. But if we know upfront that the object to sort is a simple JSArray containing only Smis, all these expensive
[[Set]] operations can be replaced by simple Loads and Stores to a
FixedArray. The main differentiator is the ElementsKind.
The problem now becomes how to implement a fast-path. The core algorithm stays the same for all but the way we access elements changes based on the
ElementsKind. One way we could accomplish this is to dispatch to the correct “accessor” on each call-site. Imagine a switch for each “load”/”store” operation where we choose a different branch based on the chosen fast-path.
Another solution (and this was the first approach tried) is to just copy the whole builtin once for each fast-path and inline the correct load/store access method. This approach turned out to be infeasible for Timsort as it’s a big builtin and making a copy for each fast-path turned out to require 106 KB in total, which is way too much for a single builtin.
The final solution is slightly different. Each load/store operation for each fast-path is put into its own “mini-builtin”. See the code example which shows the “load” operation for
To compare, the most generic “load” operation is simply a call to
GetProperty. But while the above version generates efficient and fast machine code to load and convert a
GetProperty is a call to another builtin that could potentially involve a prototype chain lookup or invoke an accessor function.
A fast-path then simply becomes a set of function pointers. This means we only need one copy of the core algorithm while setting up all relevant function pointers once upfront. While this greatly reduces the needed code space (down to 20k) it comes at the cost of an indirect branch at each access site. This is even exacerbated by the recent change to use embedded builtins.
The picture above shows the “sort state”. It’s a
FixedArray that keeps track of all the things needed while sorting. Each time
Array#sort is called, such a sort state is allocated. Entry 4 to 7 are the set of function pointers discussed above that comprise a fast-path.
The “compare” entry can point to two different builtins. One calls a user-provided comparison function while the other implements the default comparison that calls
toString on both arguments and then does a lexicographical comparison.
The rest of the fields (with the exception of the fast path ID) are Timsort-specific. The run stack (described above) is initialized with a size of 85 which is enough to sort arrays of length 264. The temporary array is used for merging runs. It grows in size as needed but never exceeds
n is the input length.
While the Torque approach doesn’t result in the same peak performance for sorting, it does avoid performance cliffs. The result is a sorting performance that is much more predictable than it previously was. Keep in mind that Torque is very much in flux and in addition of targeting CSA it might target TurboFan in the future, allowing JIT compilation of code written in Torque.
Before we started with
Array#sort, we added a lot of different micro-benchmarks to get a better understanding of the impact the re-implementation would have. The first chart shows the “normal” use case of sorting various ElementsKinds with a user-provided comparison function.
The next chart shows the impact of Timsort when processing arrays that are already sorted completely, or have sub-sequences that are already sorted one-way or another. The chart uses Quicksort as a baseline and shows the speedup of Timsort (up to 17× in the case of “DownDown” where the array consists of two reverse-sorted sequences). As can be seen, expect in the case of random data, Timsort performs better in all other cases, even though we are sorting
PACKED_SMI_ELEMENTS, where Quicksort outperformed Timsort in the microbenchmark above.
Web Tooling Benchmark
The chai benchmark spends a third of its time inside a single comparison function (a string distance calculation). The benchmark is the test suite of chai itself. Due to the data, Timsort needs some more comparisons in this case, which has a bigger impact on the overall runtime, as such a big portion of time is spent inside that particular comparison function.
Analyzing V8 heap snapshots while browsing some 50 sites (both on mobile as well as on desktop) didn’t show any memory regressions or improvements. On the one hand, this is surprising: the switch from Quicksort to Timsort introduced the need for a temporary array for merging runs, which can grow much larger than the temporary arrays used for sampling. On the other hand, these temporary arrays are very short-lived (only for the duration of the
sort call) and can be allocated and discarded rather quickly in V8’s new space.
In summary we feel much better about the algorithmic properties and the predictable performance behavior of a Timsort implemented in Torque. Timsort is available starting with V8 v7.0 and Chrome 70. Happy sorting!