One of the performance difficulties from interpreting a vector language is the accumulated temporaries. Consider:
y = 7 * x + 13
There are two different vector arithmetic operations here; an eager evaluation will perform the multiplication and then the addition. We can visualize this using the assembly language from Empirical’s Vector Virtual Machine:
mul_i64v_i64v 7 %1 %2 ; "x" is %1
add_i64v_i64v %2 13 %3 ; "y" is %3
The VVM interpreter will execute each instruction using the equivalent C++ code:
std::vector<int64_t> temp(x.size());
for (size_t i = 0; i < x.size(); i++) {
temp[i] = 7 * x[i];
}
y.resize(temp.size());
for (size_t i = 0; i...
One of the performance difficulties from interpreting a vector language is the accumulated temporaries. Consider:
y = 7 * x + 13
There are two different vector arithmetic operations here; an eager evaluation will perform the multiplication and then the addition. We can visualize this using the assembly language from Empirical’s Vector Virtual Machine:
mul_i64v_i64v 7 %1 %2 ; "x" is %1
add_i64v_i64v %2 13 %3 ; "y" is %3
The VVM interpreter will execute each instruction using the equivalent C++ code:
std::vector<int64_t> temp(x.size());
for (size_t i = 0; i < x.size(); i++) {
temp[i] = 7 * x[i];
}
y.resize(temp.size());
for (size_t i = 0; i < temp.size(); i++) {
y[i] = temp[i] + 13;
}
The %2 register is a temporary value that requires memory. It would be much more performant if we could perform a single loop:
y.resize(x.size());
for (size_t i = 0; i < x.size(); i++) {
int64_t temp = 7 * x[i];
y[i] = temp + 13;
}
Instead of interpreting, we could compile the above assembly by looking ahead through the op codes to decide how to fuse operations together. The compilation process requires significant startup cost, though.
We can hide some of that cost by interpreting in the foreground and compiling in the background, and then switching to the compiled version midway through execution. This is known as a tiered JIT and is a common technique for many production-quality VMs.
There is a third approach though.
Given the assembly for scalar instructions, we know ahead of time what the machine code should be; we just don’t know which CPU registers to use yet. We can employ copy-and-patch compilation to pre-build the machine code for each opcode into a stencil. Then at runtime, we only need to select the register value to emit the full machine code, which makes the compiler’s startup cost almost negligible.
CPython got this technique a few years ago.
I’ve only seen this trick for scalar operations. My proposal is to do something similar where the stencils are determined in advance for vector operations. We then fuse instructions simply by knowing where to start and stop the loop. I refer to this as copy-and-fuse compilation.
This result will be good for a baseline compiler and will use less memory than an interpreter. It won’t be as fast as an optimizing compiler, but the startup cost will be far less.
I don’t have time to implement this myself, so I gladly give out this idea to anyone who wants to try it.