You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/blog/leaving-the-sea-of-nodes.md
+26-2Lines changed: 26 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -247,16 +247,40 @@ Almost all phases in Turbofan mutate the graph in-place. Given that nodes are fa
247
247
248
248
It’s hard to estimate the exact impact of this cache unfriendliness on memory. Still, now that we have our new CFG compiler, we can compare the number of cache misses between the two: Sea of Nodes suffers on average from about 3 times more L1 dcache misses compared to our new CFG IR, and up to 7 times more in some phases. We estimate that this costs up to 5% of compile time, although this number is a bit handwavy. Still, keep in mind that in a JIT compiler, compiling fast is essential.
249
249
250
+
## Control-flow dependent typing is limited
251
+
252
+
Let’s consider the following JavaScript function:
253
+
254
+
```javascript
255
+
function foo(x) {
256
+
if (x < 42) {
257
+
return x + 1;
258
+
}
259
+
return x;
260
+
}
261
+
```
262
+
263
+
If so far we’ve only seen small integers for `x` and for the result of `x+1` (where “small integers” are 31-bit integers, cf. [Value tagging in V8](https://v8.dev/blog/pointer-compression#value-tagging-in-v8)), then we’ll speculate that this will remain the case. If we ever see `x` being larger than a 31-bit integer, then we will deoptimize. Similarly, if `x+1` produces a result that is larger than 31 bits, we will also deoptimize. This means that we need to check whether `x+1` is less or more than the maximum value that fits in 31 bits. Let’s have a look at corresponding the CFG and SoN graphs:
(assuming a `CheckedAdd` operation that adds its inputs and deoptimizes if the result overflows 31-bits)
269
+
With a CFG, it’s easy to realize that when `CheckedAdd(v1, 1)` is executed, `v1` is guaranteed to be less than `42`, and that there is therefore no need to check for 31-bit overflow. We would thus easily replace the `CheckedAdd` by a regular `Add`, which would execute faster, and would not require a deoptimization state (which is otherwise required to know how to resume execution after deoptimizing).
270
+
However, with a SoN graph, `CheckedAdd`, being a pure operation, will flow freely in the graph, and there is thus no way to remove the check until we’ve computed a schedule and decided that we will compute it after the branch (and at this point, we are back to a CFG, so this is not a SoN optimization anymore).
271
+
272
+
Such checked operations are frequent in V8 due to this 31-bit small integer optimization, and the ability to replace checked operations by unchecked operations can have a significant impact on quality of the code generated by Turbofan. So, Turbofan’s SoN [puts a control-input on `CheckedAdd`](https://source.chromium.org/chromium/chromium/src/+/main:v8/src/compiler/simplified-operator.cc;l=966;drc=0a1fae9e77c6d8e85d8197b4f4396815ec9194b9), which can enable this optimization, but also means introducing a scheduling constraint on a pure node, a.k.a., going back to a CFG.
273
+
250
274
## And many other issues…
251
275
252
276
**Propagating deadness is hard.** Frequently, during some lowering, we realize that the current node is actually unreachable. In a CFG, we could just cut the current basic block here, and the following blocks would automatically become obviously unreachable since they would have no predecessors anymore. In Sea of Nodes, it’s harder, because one has to patch both the control and effect chain. So, when a node on the effect chain is dead, we have to walk forward the effect chain until the next merge, killing everything along the way, and carefully handling nodes that are on the control chain.
253
277
254
278
**It’s hard to introduce new control flow.** Because control flow nodes have to be on the control chain, it’s not possible to introduce new control flow during regular lowerings. So, if there is a pure node in the graph, such as `Int32Max`, which returns the maximum of 2 integers, and which we would eventually like to lower to `if (x > y) { x } else { y }`, this is not easily doable in Sea of Nodes, because we would need a way to figure out where on the control chain to plug this subgraph. One way to implement this would be to put `Int32Max` on the control chain from the beginning, but this feels wasteful: the node is pure and should be allowed to move around freely. So, the canonical Sea of Nodes way to solve this, used both in Turbofan, and also by Cliff Click (Sea of Nodes’ inventor), as mentioned in this [Coffee Compiler Club](https://youtu.be/Vu372dnk2Ak?t=3037) chat, is to delay this kind of lowerings until we have a schedule (and thus a CFG). As a result, we have a phase around the middle of the pipeline that computes a schedule and lowers the graph, where a lot of random optimizations are packed together because they all require a schedule. By comparison, with a CFG, we would be free to do these optimizations earlier or later in the pipeline.
255
279
Also, remember from the introduction that one of the issues of Crankshaft (Turbofan’s predecessor) was that it was virtually impossible to introduce control flow after having built the graph. Turbofan is a slight improvement over this, since lowering of nodes on the control chain can introduce new control flow, but this is still limited.
256
280
257
-
**It’s hard to figure out what is inside of a loop.** Before lots of nodes are floating outside of the control chain, it’s hard to figure out what is inside each loop. As a result, basic optimizations such as loop peeling and loop unrolling are hard to implement.
281
+
**It’s hard to figure out what is inside of a loop.**Because a lot of nodes are floating outside of the control chain, it’s hard to figure out what is inside each loop. As a result, basic optimizations such as loop peeling and loop unrolling are hard to implement.
258
282
259
-
**Compiling is slow.** This is a direct consequence of multiple issues that I’ve already mentioned: it’s hard to find a good visitation order for nodes, which leads to many useless revisitation, state tracking is expensive, memory usage is bad, cache locality is bad… This might not be a big deal for an ahead of time compiler, but in a JIT compiler, compiling slowly means that we keep executing slow unoptimized code until the optimized code is ready, while taking away resources from other tasks (eg, other compilation jobs, or the Garbage Collector). One consequence of this is that we are forced to think very carefully about the compile time \- speedup tradeoff of new optimizations, often erring towards the size of optimizing less to keep optimizing fast.
283
+
**Compiling is slow.** This is a direct consequence of multiple issues that I’ve already mentioned: it’s hard to find a good visitation order for nodes, which leads to many useless revisitation, state tracking is expensive, memory usage is bad, cache locality is bad… This might not be a big deal for an ahead of time compiler, but in a JIT compiler, compiling slowly means that we keep executing slow unoptimized code until the optimized code is ready, while taking away resources from other tasks (eg, other compilation jobs, or the Garbage Collector). One consequence of this is that we are forced to think very carefully about the compile time \- speedup tradeoff of new optimizations, often erring towards the side of optimizing less to keep optimizing fast.
260
284
261
285
**Sea of Nodes destroys any prior scheduling, by construction.** JavaScript source code is typically not manually optimized with CPU microarchitecture in mind. However, WebAssembly code can be, either at the source level (C++ for instance), or by an [ahead-of-time (AOT)](https://en.wikipedia.org/wiki/Ahead-of-time_compilation) compilation toolchain (like [Binaryen/Emscripten](https://github.com/WebAssembly/binaryen)). As a result, a WebAssembly code could be scheduled in a way that should be good on most architectures (for instance, reducing the need for [spilling](https://en.wikipedia.org/wiki/Register_allocation#Components_of_register_allocation), assuming 16 registers). However, SoN always discards the initial schedule, and needs to rely on its own scheduler only, which, because of the time constraints of JIT compilation, can easily be worse than what an AOT compiler (or a C++ developer carefully thinking about the scheduling of their code) could do. We have seen cases where WebAssembly was suffering from this. And, unfortunately, using a CFG compiler for WebAssembly and a SoN compiler for JavaScript in Turbofan was not an option either, since using the same compiler for both enables inlining across both languages.
0 commit comments