Why language interpreters and virtual machines are slow

Re jit vs interpreter loop - assuming a highly optimized interpreter the loop is generally significantly less the 10% of the overhead, depending on IR granularity and the rest of the overhead.

The bulk of the JIT performance win (assuming essentially a template jit) is removing overheads like indirect loads, instruction conditions, etc required for the interpreter to actually perform the instruction, e.g consider this basic example from JSC:

$ jsc -d
>>> function f(a, b) { return a + b; }
<snip dump info for executing the above>
>>> f(1,2)
<snip dump of making the call>
f#EVgtVm:[0x1138543a0->0x113829900, NoneFunctionCall, 9]: 3 instructions (0 16-bit instructions, 0 32-bit instructions, 0 instructions with metadata); 9 bytes (0 metadata bytes); 3 parameter(s); 6 callee regist...

Re jit vs interpreter loop - assuming a highly optimized interpreter the loop is generally significantly less the 10% of the overhead, depending on IR granularity and the rest of the overhead.

$ jsc -d
>>> function f(a, b) { return a + b; }
<snip dump info for executing the above>
>>> f(1,2)
<snip dump of making the call>
f#EVgtVm:[0x1138543a0->0x113829900, NoneFunctionCall, 9]: 3 instructions (0 16-bit instructions, 0 32-bit instructions, 0 instructions with metadata); 9 bytes (0 metadata bytes); 3 parameter(s); 6 callee register(s); 5 variable(s); scope at loc4
bb#1
Predecessors: [ ]
[   0] enter
[   1] add                dst:loc5, lhs:arg1, rhs:arg2, profileIndex:0, operandTypes:OperandTypes(126, 126)
[   7] ret                value:loc5
Successors: [ ]

For the interpreter this evaluation requires:

load the index of the register for arg1 (indirect load through instruction pointer)
load frame[arg1Idx] (unpredictable dependent load)
load the index of the register for arg2 (indirect load through instruction pointer)
load frame[arg2Idx] (unpredictable dependent load)
perform the addition
load the index of loc5 (indirect load through instruction pointer)
store the result into frame[loc5idx] (unpredictable dependent store, which may of course may mess with a subsequent instruction performing an unpredictable load that is for frame[loc5idx]

So in the interpreter we have a pile of indirect and dependent loads[1], and may have runtime branches determined by instruction parameters. In the jitted code we lose the reads from the instruction pointer, the register indices all become constants, many/most branches determined by instruction parameters disappear as the behaviour is known at compilation time, because the registers stores/loads have constant offsets you get load/store forwarding, and similar between instructions. All of these are responsible for a much larger proportion of the interpreter to template jit performance win.

Obviously once the optimizers kick in the relationship between the interpreter instructions and generated code disappears so a performance comparison there isn’t relevant (from the pov of things like interpreter loop overheads, etc).

[1] the add does type checks on entry, but will swap the generic add instruction with one the prioritized the types seen, so that’s generally well predicted, this happens more or less immediately and before any kind of jit compilation happens, so the bizarreness of JS add isn’t super expensive

Similar Posts