1. 原文

Embedded builtins

2. 摘要翻译

V8 的内建函数(builtins)在每个 V8 实例内都会消耗内存。builtin 数量,平均大小,以及每个 Chrome 浏览器 tab 中的 V8 实例数量都显著增长了。这篇博客将会详细描述我们在之前的一年中是如何将 V8 堆内存大小中位数降低 19% 的。

Background

V8 ships with an extensive library of JavaScript (JS) built-in functions. Many builtins are directly exposed to JS developers as functions installed on JS built-in objects, such as RegExp.prototype.exec and Array.prototype.sort; other builtins implement various internal functionality. Machine code for builtins is generated by V8’s own compiler, and is loaded onto the managed heap state for every V8 Isolate upon initialization. An Isolate represents an isolated instance of the V8 engine, and every browser tab in Chrome contains at least one Isolate. Every Isolate has its own managed heap, and thus its own copy of all builtins.

Back in 2015, builtins were mostly implemented in self-hosted JS, native assembly, or in C++. They were fairly small, and creating a copy for every Isolate was less problematic.

Much has changed in this space over the last years.

In 2016, V8 began experimenting with builtins implemented in CodeStubAssembler (CSA). This turned out to both be convenient (platform-independent, readable) and to produce efficient code, so CSA builtins became ubiquitous. For a variety of reasons, CSA builtins tend to produce larger code, and the size of V8 builtins roughly tripled as more and more were ported to CSA. By mid-2017, their per-Isolate overhead had grown significantly and we started thinking about a systematic solution.

V8 snapshot size (including builtins) from 2015 until 2017

In late 2017, we implemented lazy builtin (and bytecode handler) deserialization as a first step. Our initial analysis showed that most sites used less than half of all builtins. With lazy deserialization, builtins are loaded on-demand, and unused builtins are never loaded into the Isolate. Lazy deserialization was shipped in Chrome 64 with promising memory savings. But: builtin memory overhead was still linear in the number of Isolates.

Then, Spectre was disclosed, and Chrome ultimately turned on site isolation to mitigate its effects. Site isolation limits a Chrome renderer process to documents from a single origin. Thus, with site isolation, many browsing tabs create more renderer processes and more V8 Isolates. Even though managing per-Isolate overhead has always been important, site isolation has made it even more so.

概略翻译: V8 的 builtin 函数会占用内存,并且会在 isolate 初始化的时候被装载进 heap 堆内存里。从 2016 年开始,V8 使用 CodeStubAssembler (CSA) 来 对 builtin 进行编码。这种做法有利有弊,比较大的问题就是代码量及体积比较大。2017 年开始,V8 开始使用惰式加载 builtin,但这种做法并没有解决根本的问题。

Embedded builtins

Our goal for this project was to completely eliminate per-Isolate builtin overhead.

The idea behind it was simple. Conceptually, builtins are identical across Isolates, and are only bound to an Isolate because of implementation details. If we could make builtins truly isolate-independent, we could keep a single copy in memory and share them across all Isolates. And if we could make them process-independent, they could even be shared across processes.

In practice, we faced several challenges. Generated builtin code was neither isolate- nor process-independent due to embedded pointers to isolate- and process-specific data. V8 had no concept of executing generated code located outside the managed heap. Builtins had to be shared across processes, ideally by reusing existing OS mechanisms. And finally (this turned out to be the long tail), performance must not noticeably regress.

The following sections describe our solution in detail.

概略翻译: 当前优化项目的目标是完全地解决每个 isolate 内的 builtin 内存消耗。解决方案思路很简单,就是将 builtin 独立出来,不要和 isolate 耦合在一起,那么所有的 isolate 就可以使用一份内存了。但在实践中也遇到了诸多挑战。V8 并没有在堆内存之外执行声称出来代码的机制。等等。

Isolate- and process-independent code

Builtins are generated by V8’s compiler internal pipeline, which embeds references to heap constants (located on the Isolate’s managed heap), call targets (Code objects, also on the managed heap), and to isolate- and process-specific addresses (e.g.: C runtime functions or a pointer to the Isolate itself, also called ’external references’) directly into the code. In x64 assembly, a load of such an object could look as follows:

// Load an embedded address into register rbx. REX.W movq rbx,0x56526afd0f70

V8 has a moving garbage collector, and the location of the target object could change over time. Should the target be moved during collection, the GC updates the generated code to point at the new location.

On x64 (and most other architectures), calls to other Code objects use an efficient call instruction which specifies the call target by an offset from the current program counter (an interesting detail: V8 reserves its entire CODE_SPACE on the managed heap at startup to ensure all possible Code objects remain within an addressable offset of each other). The relevant part of the calling sequence looks like this:

// Call instruction located at [pc + <offset>].
call <offset>


A pc-relative call

Code objects themselves live on the managed heap and are movable. When they are moved, the GC updates the offset at all relevant call sites.

In order to share builtins across processes, generated code must be immutable as well as isolate- and process-independent. Both instruction sequences above do not fulfill that requirement: they directly embed addresses in the code, and are patched at runtime by the GC.

To address both issues, we introduced an indirection through a dedicated, so-called root register, which holds a pointer into a known location within the current Isolate.

Isolate layout

V8’s Isolate class contains the roots table, which itself contains pointers to root objects on the managed heap. The root register permanently holds the address of the roots table.

The new, isolate- and process-independent way to load a root object thus becomes:

// Load the constant address located at the given
// offset from roots.
REX.W movq rax,[kRootRegister + <offset>]

Root heap constants can be loaded directly from the roots list as above. Other heap constants use an additional indirection through a global builtins constant pool, itself stored on the roots list:

// Load the builtins constant pool, then the
// desired constant.
REX.W movq rax,[kRootRegister + <offset>]
REX.W movq rax,[rax + 0x1d7]

For Code targets, we initially switched to a more involved calling sequence which loads the target Code object from the global builtins constant pool as above, loads the target address into a register, and finally performs an indirect call.

With these changes, generated code became isolate- and process-independent and we could begin working on sharing it between processes.

概略翻译: 主要介绍了独立于 isolate 和 进程 之外的代码做了什么改动,如何改动的。

Sharing across processes

We initially evaluated two alternatives. Builtins could either be shared by mmap-ing a data blob file into memory; or, they could be embedded directly into the binary. We took the latter approach since it had the advantage that we would automatically reuse standard OS mechanisms to share memory across processes, and the change would not require additional logic by V8 embedders such as Chrome. We were confident in this approach since Dart’s AOT compilation had already successfully binary-embedded generated code.

An executable binary file is split into several sections. For example, an ELF binary contains data in the .data (initialized data), .ro_data (initialized read-only data), and .bss (uninitialized data) sections, while native executable code is placed in .text. Our goal was to pack the builtins code into the .text section alongside native code.


Sections of an executable binary file

This was done by introducing a new build step that used V8’s internal compiler pipeline to generate native code for all builtins and output their contents in embedded.cc. This file is then compiled into the final V8 binary.

The (simplified) V8 embedded build process

The embedded.cc file itself contains both metadata and generated builtins machine code as a series of .byte directives that instruct the C++ compiler (in our case, clang or gcc) to place the specified byte sequence directly into the output object file (and later the executable).

// Information about embedded builtins are included in
// a metadata table.
V8_EMBEDDED_TEXT_HEADER(v8_Default_embedded_blob_)
__asm__(“.byte 0x65,0x6d,0xcd,0x37,0xa8,0x1b,0x25,0x7e\n”
[snip metadata]

// Followed by the generated machine code.
__asm__(V8_ASM_LABEL(“Builtins_RecordWrite”));
__asm__(“.byte 0x55,0x48,0x89,0xe5,0x6a,0x18,0x48,0x83\n”
[snip builtins code]

Contents of the .text section are mapped into read-only executable memory at runtime, and the operating system will share memory across processes as long as it contains only position-independent code without relocatable symbols. This is exactly what we wanted.

But V8’s Code objects consist not only of the instruction stream, but also have various pieces of (sometimes isolate-dependent) metadata. Normal run-of-the-mill Code objects pack both metadata and the instruction stream into a variable-sized Code object that is located on the managed heap.

On-heap Code object layout

As we’ve seen, embedded builtins have their native instruction stream located outside the managed heap, embedded into the .text section. To preserve their metadata, each embedded builtin also has a small associated Code object on the managed heap, called the off-heap trampoline. Metadata is stored on the trampoline as for standard Code objects, while the inlined instruction stream simply contains a short sequence which loads the address of the embedded instructions and jumps there.

Off-heap Code object layout

The trampoline allows V8 to handle all Code objects uniformly. For most purposes, it is irrelevant whether the given Code object refers to standard code on the managed heap or to an embedded builtin.

概略翻译: 介绍了内存跨进程共享设计方面的内容。最后选择的方案是外部的可执行二进制文件的方式来存储 builtin。

Optimizing for performance

With the solution described in previous sections, embedded builtins were essentially feature-complete, but benchmarks showed that they came with significant slowdowns. For instance, our initial solution regressed Speedometer 2.0 by more than 5% overall.

We began to hunt for optimization opportunities, and identified major sources of slowdowns. The generated code was slower due to frequent indirections taken to access isolate- and process-dependent objects. Root constants were loaded from the root list (1 indirection), other heap constants from the global builtins constant pool (2 indirections), and external references additionally had to be unpacked from within a heap object (3 indirections). The worst offender was our new calling sequence, which had to load the trampoline Code object, call it, only to then jump to the target address. Finally, it appears that calls between the managed heap and binary-embedded code were inherently slower, possibly due to the long jump distance interfering with the CPU’s branch prediction.

Our work thus concentrated on 1. reducing indirections, and 2. improving the builtin calling sequence. To address the former, we altered the Isolate object layout to turn most object loads into a single root-relative load. The global builtins constant pool still exists, but only contains infrequently-accessed objects.


Optimized Isolate layout

Calling sequences were significantly improved on two fronts. Builtin-to-builtin calls were converted into a single pc-relative call instruction. This was not possible for runtime-generated JIT code since the pc-relative offset could exceed the maximal 32-bit value. There, we inlined the off-heap trampoline into all call sites, reducing the calling sequence from 6 to just 2 instructions.

With these optimizations, we were able to limit regressions on Speedometer 2.0 to roughly 0.5%.

概略翻译: 在研发完成之后,性能测试显示有比较明显的性能降低,Speedometer 2.0 来说是是 5%。最终优化结果 Speedometer 2.0 性能降低为 0.5%。

Results

We evaluated the impact of embedded builtins on x64 over the top 10k most popular websites, and compared against both lazy- and eager deserialization (described above).

V8 heap size reduction vs. eager and lazy deserialization

Whereas previously Chrome would ship with a memory-mapped snapshot that we’d deserialize on each Isolate, now the snapshot is replaced by embedded builtins that are still memory mapped but do not need to be deserialized. The cost for builtins used to be c*(1 + n) where n is the number of Isolates and c the memory cost of all builtins, whereas now it’s just c * 1 (in practice, a small amount of per-Isolate overhead also remains for off heap trampolines).

Compared against eager deserialization, we reduced the median V8 heap size by 19%. The median Chrome renderer process size per site has decreased by 4%. In absolute numbers, the 50th percentile saves 1.9 MB, the 30th percentile saves 3.4 MB, and the 10th percentile saves 6.5 MB per site.

Significant additional memory savings are expected once bytecode handlers are also binary-embedded.

Embedded builtins are rolling out on x64 in Chrome 69, and mobile platforms will follow in Chrome 70. Support for ia32 is expected to be released in late 2018.

Note: All diagrams were generated using Vyacheslav Egorov’s awesome Shaky Diagramming tool.

概略翻译: 内存消耗量优化结果如图:黑色的柱子是优化完毕之后和之前主动加载所有 builtin 的差距;灰色的柱子是优化完毕之后和之前惰式加载所有 builtin 的差距。与主动加载相比,V8 堆内存大小的中值降低了 19%。渲染器进程的中值降低了 4%。以数值来说,50% 的情况节约了 1.9 MB,30% 的情况节约了 3.4 MB,10% 的情况节约了 6.5 MB。Embedded builtins 将会在 Chrome 69 上线。

EOF