1. 原文

Improving DataView performance in V8

2. 摘要翻译

DataViews are one of the two possible ways to do low-level memory accesses in JavaScript, the other one being TypedArrays. Up until now, DataViews were much less optimized than TypedArrays in V8, resulting in lower performance on tasks such as graphics-intensive workloads or when decoding/encoding binary data. The reasons for this have been mostly historical choices, like the fact that asm.js chose TypedArrays instead of DataViews, and so engines were incentivized to focus on performance of TypedArrays.

DataViews 是在JS中可以直接进行低级别内存操作的两种可行方法中的其中一种,另一种则是 TypedArrays。迄今为止,在 V8 中,DataViews 的优化比 TypedArrays 来的要少,导致了在某些操作上的低性能,比如说强调图像处理的工作或编码/解码二进制数据。原因主要在于历史性的选择,类似于 asm.js 选择了 TypedArrays 而不是 DataViews,导致了引擎更倾向于聚焦于 TypedArrays 的性能。

Because of the performance penalty, JavaScript developers such as the Google Maps team decided to avoid DataViews and rely on TypedArrays instead, at the cost of increased code complexity. This blog post explains how we brought DataView performance to match — and even surpass — equivalent TypedArray code in V8 v6.9, effectively making DataView usable for performance-critical real-world applications.

因为性能原因,诸如谷歌地图团队这样的JS开发者决定避免使用 DataViews 并依赖 TypedArrays 作为替代,这导致了代码复杂度提升的代价。这篇博客解释了我们是如何在 V8 6.9 中,将 DataViews 性能提升到和 TypedArrays 代码性能一致甚至超越的水平,使得 DataView 的性能达到在真实世界的性能敏感应用中高效可用。

Background

Since the introduction of ES2015, JavaScript has supported reading and writing data in raw binary buffers called ArrayBuffers. ArrayBuffers cannot be directly accessed; rather, programs must use a so-called array buffer view object that can be either a DataView or a TypedArray.

自从 ES2015 以来,JS支持直接读写二进制缓存 ArrayBuffers 中的数据。ArrayBuffers 不可以被直接访问到,而是要让程序使用被称为 array buffer view object 的东西来进行访问,这可以是 DataView 或 TypedArray。

TypedArrays allow programs to access the buffer as an array of uniformly typed values, such as an Int16Array or a Float32Array.

TypedArrays 允许程序像访问一个值类型一致的数组一样访问缓存,例如 Int16Array 或 Float32Array。

const buffer = new ArrayBuffer(32);
const array = new Int16Array(buffer);

for (let i = 0; i < array.length; i++) {
  array[i] = i * i;
}

console.log(array);
// → [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

On the other hand, DataViews allow for more fine-grained data access. They let the programmer choose the type of values read from and written tothe buffer by providing specialized getters and setters for each number type, making them useful for serializing data structures.

而 DataViews 则允许更细致的数据操作。它们允许程序员选择从缓存中读取及写入的值类型,通过为每个数字类型提供专门的 getters 和 setters,来进行数据类型的序列化。

const buffer = new ArrayBuffer(32);
const view = new DataView(buffer);

const person = { age: 42, height: 1.76 };

view.setUint8(0, person.age);
view.setFloat64(1, person.height);

console.log(view.getUint8(0)); // Expected output: 42
console.log(view.getFloat64(1)); // Expected output: 1.76

Moreover, DataViews also allow the choice of the endianness of the data storage, which can be useful when receiving data from external sources such as the network, a file, or a GPU.

此外,DataViews 也允许对数据存储的字节顺序进行选择,这会对从外部源(比如说网络、文件或GPU)获取数据很有帮助。

const buffer = new ArrayBuffer(32);
const view = new DataView(buffer);

view.setInt32(0, 0x8BADF00D, true); // Little-endian write.
console.log(view.getInt32(0, false)); // Big-endian read.
// Expected output: 0x0DF0AD8B (233876875)

An efficient DataView implementation has been a feature request for a long time (see this bug report from over 5 years ago), and we are happy to announce that DataView performance is now on par!

一个高性能的 DataView 实现已经是一个长期存在的功能需求了(见 bug,5年前已存在),我们非常高兴现在 DataView 的性能已经得到了提升。

Legacy runtime implementation

Until recently, the DataView methods used to be implemented as built-in C++ runtime functions in V8. This is very costly, because each call would require an expensive transition from JavaScript to C++ (and back).

直到最近,DataView 方法过去一直由内建在 V8 的 C++ 运行时函数实现。这非常耗,因为每个调用都必须从 JS 到 C++ 进行一次转换(反向亦然)。

In order to investigate the actual performance cost incurred by this implementation, we set up a performance benchmark that compares the native DataView getter implementation with a JavaScript wrapper simulating DataView behavior. This wrapper uses an Uint8Array to read data byte by byte from the underlying buffer, and then computes the return value from those bytes. Here is, for example, the function for reading little-endian 32-bit unsigned integer values:

为了调查由这种实现导致的真实性能损耗,我们设置了一个性能 benchmark 来比对原生的 DataView getter 实现,和一个 JS wrapper 模拟的 DataView 行为。这个 wrapper 使用一个 Uint8Array 来从内部的缓存一个字节一个字节读取数据,最终从这些字节里计算出返回结果。下面是一个例子,这个函数读取 小端 32位 无符号 整型值:

function LittleEndian(buffer) { // Simulate little-endian DataView reads.
  this.uint8View_ = new Uint8Array(buffer);
}

LittleEndian.prototype.getUint32 = function(byteOffset) {
  return this.uint8View_[byteOffset] |
    (this.uint8View_[byteOffset + 1] << 8) |
    (this.uint8View_[byteOffset + 2] << 16) |
    (this.uint8View_[byteOffset + 3] << 24);
};

TypedArrays are already heavily optimized in V8, so they represent the performance goal that we wanted to match.

TypedArrays 已经在 V8 里得到了很大程度上的优化,因此它们即是目前的性能优化目标。

Original DataView performance

Our benchmark shows that native DataView getter performance was as much as 4 times slower than the Uint8Array based wrapper, for both big-endian and little-endian reads.

我们的 benchmark 显示原生的 DataView getter 性能是基于 Uint8Array 的 wrapper 的4倍之慢,无论是大端还是小端的读操作。

Improving baseline performance

Our first step in improving the performance of DataView objects was to move the implementation from the C++ runtime to CodeStubAssembler (also known as CSA). CSA is a portable assembly language that allows us to write code directly in TurboFan’s machine-level intermediate representation (IR), and we use it to implement optimized parts of V8’s JavaScript standard library. Rewriting code in CSA bypasses the call to C++ completely, and also generates efficient machine code by leveraging TurboFan’s backend.

我们第一步对 DataView 的优化操作是将其实现从 C++ 转移到 CSA。CSA 是一种便携式汇编语言,允许我们直接在 TurboFan 的 machine-level intermediate representation (IR) 中编写代码,我们使用它来实现 V8 标准库中的优化部分。使用 CSA 重写代码绕开了 C++ 的调用,并且能生成高效的机器代码来让 TurboFan 的后端利用。

However, writing CSA code by hand is cumbersome. Control flow in CSA is expressed much like in assembly, using explicit labels and gotos, which makes the code harder to read and understand at a glance.

然而,直接编写 CSA 代码是非常困难的。CSA 中的控制流表达更类似于汇编语言,使用 explicit labels 以及 gotos,这会导致代码更难阅读以及理解。

In order to make it easier for developers to contribute to the optimized JavaScript standard library in V8, and to improve readability and maintainability, we started designing a new language called V8 Torque, that compiles down to CSA. The goal for Torque is to abstract away the low-level details that make CSA code harder to write and maintain, while retaining the same performance profile.

为了让开发者更容易向 V8 优化的 JS 标准库贡献代码,并提升可读性及可维护性,我们开始设计一种新语言被称为 V8 Torque,这种语言会被编译为 CSA。Torque 的目标是将低级别的细节(使得 CSA 代码更难编写和维护)被抽象隔离开来,并保持性能不变。

Rewriting the DataView code was an excellent opportunity to start using Torque for new code, and helped provide the Torque developers with a lot of feedback about the language. This is what the DataView’s getUint32() method looks like, written in Torque:

重写 DataView 代码是在代码中开始使用 Torque 的一个绝佳机会,帮助提供了关于 Torque 的一系列关于这个语言的反馈。下面的代码是 DataView 的 getUint32() 方法的 Torque 实现:

macro LoadDataViewUint32(buffer: JSArrayBuffer, offset: intptr,
                    requested_little_endian: bool,
                    signed: constexpr bool): Number {
  let data_pointer: RawPtr = buffer.backing_store;

  let b0: uint32 = LoadUint8(data_pointer, offset);
  let b1: uint32 = LoadUint8(data_pointer, offset + 1);
  let b2: uint32 = LoadUint8(data_pointer, offset + 2);
  let b3: uint32 = LoadUint8(data_pointer, offset + 3);
  let result: uint32;

  if (requested_little_endian) {
    result = (b3 << 24) | (b2 << 16) | (b1 << 8) | b0;
  } else {
    result = (b0 << 24) | (b1 << 16) | (b2 << 8) | b3;
  }

  return convert<Number>(result);
}

Moving the DataView methods to Torque already showed a 3× improvement in performance, but did not quite match Uint8Array based wrapper performance yet.

将 DataView 方法交由 Torque 实现已经显示出了3倍性能提升,但仍旧未能达到基于 Uint8Array 的 wrapper 的性能。

Torque DataView performance

Optimizing for TurboFan

When JavaScript code gets hot, we compile it using our TurboFan optimizing compiler, in order to generate highly-optimized machine code that runs more efficiently than interpreted bytecode.

当 JS 代码变得 hot,我们会使用 TurboFan 优化编译器将其编译,来生成高度优化的机器码,这会比解释型字节码要高效得多。

TurboFan works by translating the incoming JavaScript code into an internal graph representation (more precisely, a “sea of nodes”). It starts with high-level nodes that match the JavaScript operations and semantics, and gradually refines them into lower and lower level nodes, until it finally generates machine code.

TurboFan 将传入的 JS 代码翻译为内部图表达式(见 a “sea of nodes”)。开始会处理高层级的 node 符合 JS 操作和语义,然后缓慢重构成低层级的 node,直到最终生成机器码。

In particular, a function call, such as calling one of the DataView methods, is internally represented as a JSCall node, which eventually boils down to an actual function call in the generated machine code.

优化原理分析 …

However, TurboFan allows us to check whether the JSCall node is actually a call to a known function, for example one of the builtin functions, and inline this node in the IR. This means that the complicated JSCall gets replaced at compile-time by a subgraph that represents the function. This allows TurboFan to optimize the inside of the function in subsequent passes as part of a broader context, instead of on its own, and most importantly to get rid of the costly function call.

优化原理分析 …

Initial TurboFan DataView performance

Implementing TurboFan inlining finally allowed us to match, and even exceed, the performance of our Uint8Array wrapper, and be 8 times as fast as the former C++ implementation.

实现 TurboFan 内联最终会使得我们赶上,甚至超过基于 Uint8Array 的 wrapper 的性能,并达到之前 C++ 实现的8倍性能。

Further TurboFan optimizations

Looking at the machine code generated by TurboFan after inlining the DataView methods, there was still room for some improvement. The first implementation of those methods tried to follow the standard pretty closely, and threw errors when the spec indicates so (for example, when trying to read or write out of the bounds of the underlying ArrayBuffer).

优化原理分析 …

However, the code that we write in TurboFan is meant to be optimized to be as fast as possible for the common, hot cases — it doesn’t need to support every possible edge case. By removing all the intricate handling of those errors, and just deoptimizing back to the baseline Torque implementation when we need to throw, we were able to reduce the size of the generated code by around 35%, generating a quite noticeable speedup, as well as considerably simpler TurboFan code.

优化原理分析 …

Following up on this idea of being as specialized as possible in TurboFan, we also removed support for indices or offsets that are too large (outside of Smi range) inside the TurboFan-optimized code. This allowed us to get rid of handling of the float64 arithmetic that is needed for offsets that do not fit into a 32-bit value, and to avoid storing large integers on the heap.

优化原理分析 …

Compared to the initial TurboFan implementation, this more than doubled the DataView benchmark score. DataViews are now up to 3 times as fast as the Uint8Array wrapper, and around 16 times as fast as our original DataView implementation!

和最初的 TuiboFan 实现比较起来,这种实现将 DataView benchmark 的分数翻了个倍还要多。DataViews 现在已经达到了 Uint8Array wrapper 3倍的性能,并几乎达到了我们最早的 DataView 实现的16倍性能提升之多。

Final TurboFan DataView performance

Impact

We’ve evaluated the performance impact of the new implementation on some real-world examples, on top of our own benchmark.

我们已经基于我们自己的 benchmark 进行了在真实场景例子上,最新实现的性能影响评估。

DataViews are often used when decoding data encoded in binary formats from JavaScript. One such binary format is FBX, a format that is used for exchanging 3D animations. We’ve instrumented the FBX loader of the popular three.js JavaScript 3D library, and measured a 10% (around 80 ms) reduction in its execution time.

DataViews 经常被用于在 JS 中进行二进制格式数据的编解码。有一种二进制编码是 FBX,这种格式经常被使用在转换 3D 动画。我们测试了流行的 three.js 3D JS 库的 FBX loader,检测到了 10%(大约 80 ms)的执行时间缩减。

We compared the overall performance of DataViews against TypedArrays. We found that our new DataView implementation provides almost the same performance as TypedArrays when accessing data aligned in the native endianness (little-endian on Intel processors), bridging much of the performance gap and making DataViews a practical choice in V8.

我们比对了 DataViews 以及 TypedArrays 的整体性能。并发现我们新的 DataView 实现的性能基本上和 TypedArrays 保持一致,当访问原生字节顺序对齐的数据时(在 Intel 处理器上的低位),弥补了性能瓶颈并使得 DataViews 成为 V8 中的一个实用选项。

DataView vs. TypedArray peak performance

EOF