NeverSawUs

On the Performance of Closures in V8

trevnorris recently wrote a great post titled “Long live the callbacks” in which, among other subjects, he compares the performance of two hypothetical Point classes, each running a distance method, that exhibit drastically different performance characteristics. Paraphrasing, the recommendation from this portion of his post was to avoid creating functions that rely on variables from an enclosing scope in performance critical paths. When I first read trevnorris's conclusions about the Point benchmark, I was skeptical.

Could the performance difference really be explained satisfactorily by the mere presence of variable lookups into another scope? And if so, why?

I resolved to find a root cause (and hopefully, in so doing, reveal that closures were not horrifyingly bad for performance). This post roughly chronicles my progress in finding such an explanation.

TL;DR, the thrilling conclusion.

The code from this post.

The Benchmark

// ./closure.js
// the closure utilizing version
module.exports = Point

function Point(x0, y0, x1, y1) {
  this.distance = function dist() {
    var x = x1 - x0

    var y = y1 - y0

    return Math.sqrt(x * x + y * y)
  }
}

// ./bench.js
var Point = require('./' + (process.env.WHICH || 'closure'))
var iter = 1000000
var rand = Math.random
for (var i = 0; i < iter; ++i) {
  var p = new Point(rand(), rand(), rand(), rand())
  for (var j = 0; j < 1000; ++j) {
    p.distance()
  }
}

The first thing that jumped out at me from reading trevnorris' post was the following observation:

... so it took 33 seconds and used right around 50MB memory. Let's make a slight adjustment to the implementation:

(What follows is a slightly modified version of his original refinement.)

// ./proto.js
// the prototype-utilizing version
module.exports = Point

function Point(x0, y0, x1, y1) {
  this.x0 = x0
  this.y0 = y0
  this.x1 = x1
  this.y1 = y1
}

Point.prototype.distance = function() {
  var x = this.x1 - this.x0
    , y = this.y1 - this.y0

  return Math.sqrt(x * x + y * y)
}

... Now it runs in under 2 seconds and only uses 15MB memory.

My first theory was that the primary difference between the two classes was that the original, closure'd version would create a function instance on every instantiation in addition to the Point instance itself. In my experience, each function instance that V8 encounters is treated as a totally distinct function by the optimizer.

In other words optimizations apply to function instances, not function definitions; and every thousand times through this loop we would create a new Point instance with a new point.distance function instance and have to re-warm the JIT against this new function. Given the relatively small inner loop count compared to the outer loop count, this seemed like a potentially viable explanation for the slowdown.


But what about the memory?

Given that we were creating one million Point instances in our benchmark, I reasoned that the overhead of creating an extra function instance (for .distance) per iteration could eventually affect that sort of memory behavior — especially if it was eagerly materializing .distance's .prototype object. This felt tenuous, but it was testable separately from the first explanation — we'd simply compare --trace_gc as the two versions ran, and see if one collected (and thus generated) more garbage than the other.

My Hypothesis

If the object instantiation from the outer loop is removed so that the benchmark tests only the .distance function, then the performance gap between the two approaches should shrink considerably.

To test this hypothesis, I modified ./bench.js so that it only ever creates one Point instance: that way it's only testing the speed of the .distance function itself. My assumption was that by doing so, I would see that the performance gap would shrink considerably, and that the memory usage would go down drastically. I used process.hrtime for the former, and looked at GC behavior for the latter (since I couldn't replicate the memory usage locally using process.memoryUsage.

// ./bench.js
var Point = require('./' + (process.env.WHICH || 'closure'))
var iter = 1000000 * 1000
var rand = Math.random
var hrtime = process.hrtime() 
var p = new Point(rand(), rand(), rand(), rand())
for (var i = 0; i < iter; ++i) {
  p.distance()
}
hrtime = process.hrtime(hrtime) // get the difference since original measurement

console.log(
    '%sms'
  , hrtime[0] * 1e3 + hrtime[1] / 1e6
)

I got the following results:

$ time WHICH=proto node bench.js 
883.941066ms

real    0m0.943s
user    0m0.930s
sys 0m0.016s

$ time WHICH=closure node bench.js 
12233.704858ms

real    0m12.291s
user    0m12.295s
sys     0m0.059s

Hypothesis: Disproven

It's not the allocation!

The closure was still really slow. I checked the GC behavior next. Having deleted any explicit allocation from the benchmark, I fully expected the garbage collector to be run infrequently on both closure and proto. How wrong I was:

$ time WHICH=proto node --trace_gc bench
[34339]       34 ms: Scavenge 2.0 (35.0) -> 1.6 (36.0) MB, 0 ms [Runtime::PerformGC].
875.469205ms

real    0m0.935s
user    0m0.922s
sys     0m0.016s

$ WHICH=closure node --trace_gc bench.js 

[34421]       32 ms: Scavenge 2.0 (35.0) -> 1.6 (36.0) MB, 0 ms [Runtime::PerformGC].
[34421]       34 ms: Scavenge 2.3 (36.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]       35 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]       35 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]       36 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]       37 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
# ... snip snip snip ...
[34421]    12076 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12077 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12078 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12079 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12080 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12080 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421]    12081 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].

real    0m12.116s
user    0m12.128s
sys     0m0.067s

The closure was really putting the GC through its paces, while the prototype version barely generated any garbage — which seems impossible at first glance. After all, the only things to generate garbage are the function calls themselves (creating stack frames, the arguments object, etc, etc).

This is about the point where I slapped my forehead in realization: The only way to eliminate that allocation is to eliminate the function call. V8 was inlining .distance in the prototype version, but not in the closure version.

$ time WHICH=closure node --trace_inlining bench 
Inlining builtin 0x13c86dd1c901 <JS Function sqrt>
Did not inline require called from  (target requires context change).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline Point called from  (target contains unsupported syntax [early]).
Did not inline dist called from  (target requires context change).
12064.476165ms

real    0m12.122s
user    0m12.122s
sys     0m0.061s

$ time WHICH=proto node --trace_inlining bench 
Inlining builtin 0x3e56e111c901 <JS Function sqrt>
Did not inline require called from  (target requires context change).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Inlined Point called from .
Inlining builtin 0x3e56e111c901 <JS Function sqrt>
Inlined Point.distance called from .
889.616314ms

real    0m0.947s
user    0m0.935s
sys     0m0.016s

Sure enough, V8 was able to inline the prototype version of .distance, but not the closure version. A quick grep through the v8 codebase led me to the culprit:

  // Target must be able to use caller's context.
  CompilationInfo* outer_info = current_info();
  if (target->context() != outer_info->closure()->context() ||
      outer_info->scope()->contains_with() ||
      outer_info->scope()->num_heap_slots() > 0) {
    TraceInline(target, caller, "target requires context change");
    return false;
  }

In particular, we're concerned with the first check — of the function-to-be-inlined's context versus the function-to-accept-the-inline's context. This line prevents any function that closes over an outer value from being inlined in a loop that does not also have access to those variables at that scope.

Armed with this newfound understanding, I set out to write a hybrid approach that would use the "instance method" approach from closure, with the "this as context" from proto:

// hybrid.js:
module.exports = Point

function Point(x0, y0, x1, y1) {
  this.x0 = x0
  this.x1 = x1
  this.y0 = y0
  this.y1 = y1

  // an instancemethod that looks up the
  // properties needed from `this` instead
  // of from scope.
  this.distance = function distance() {
    var x = this.x1 - this.x0
      , y = this.y1 - this.y0

    return Math.sqrt(x * x + y * y)
  }
}
$ WHICH=hybrid node --trace_inlining bench.js
Inlining builtin 0x27153e11c901 <JS Function sqrt>
Did not inline require called from  (target requires context change).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline random called from  (target not inlineable).
Did not inline Point called from  (target contains unsupported syntax [early]).
Inlining builtin 0x27153e11c901 <JS Function sqrt>
Inlined distance called from .
888.523968ms

The hybrid distance is inlined! This confused me for a second — distance clearly has x0, y0, x1, and y1 in scope, but somehow those don't fail the context comparison in v8.

A while back, I had the experience of helping a co-worker debug a frontend JS issue with the chrome debugger. They had set a breakpoint in a function and were trying to reference a variable that hadn't been used in that scope, but they kept getting ReferenceError's. It turned out that since the variable wasn't used in the function's scope, it had been omitted from that function's context and any attempt (from the debugger) to reach the value stored in that variable would trigger an error. That was a fun lesson, and I thought back to it when pondering over why the function might be inline-able despite the presence of an obviously different scope chain.

It seems the only allowed context that the function-to-be-inlined is allowed to keep is its arguments — which are directly provided by the calling function-to-accept-the-inline. Interestingly, this doesn't count against the function's promotion up the scope chain because it's passed as an argument internally (which makes sense, as it's determined at the call site).

Lessons learned

And an admonition

  • The admonition: microbenchmarks are a poor indicator of your app's overall performance. Benchmark your app, find the bottlenecks, find a more appropriate algorithm — and if one doesn't exist, then start worrying about the inlining behavior of your code.

  • If you're writing performance critical code, be keenly aware of the inlining behavior of your code. Run your code with --trace_inlining to see what's getting inlined, and what isn't.

  • If something isn't getting inlined, check to see that it doesn't close over any outer variables that the loop doesn't have access to.

  • Barring that, don't be afraid to clone the v8 codebase and search for the appropriate "reason" string you obtain from --trace_inlining.