Firth.js and Macros for JavaScript

Yesterday, I got 3 free hours to work on personal projects, so I decided to port cFirth (a FirthVM written in C) to fairly efficient JavaScript. Since I wrote the original cFirth implementation on a train in about 6 hours, I figured 3 hours would be enough to port a 64 bit C program to a 32bit ArrayBuffer implementation in JavaScript. The real difference being that 6 packed instructions per instruction register fetch, instead of 12, 32 bit literal instructions, and the use of labeled nested while loops, instead of goto.

Structurally, I kept the two almost identical. Both have a main instruction fetch loop and an inner decode loop. Inside of the inner loop is a switch statement which does instruction dispatch. And then each instruction matches the Verilog definition as best can be done in JavaScript. The big differences lie in ArrayBuffer objects are used as main memory, the data stack, and the return stack.

One of the peculiar implementation details of ArrayBuffer objects, is that access of the buffer object is done with a typed proxy object. If you look at the source of an emscripten project you'll see an ArrayBuffer used as the heap, and a separate HEAP# variable for the typed array to access the main heap buffer by each of the 6 alignment + sign combinations for access. HEAP8 provides byte oriented access, whereas HEAPU32 provides unsigned long access to the same memory space. While this scheme works well for machine generated code, the human aspect is suspect. When I did my accessors, instead of separate variables, I attached the accessor proxy to the ArrayBuffer object itself as an additional property:

Memory = new ArrayBuffer(RAM)
Memory.char = new Int8Array(Memory)
Memory.word = new Int16Array(Memory)
Memory.int = new Int32Array(Memory)
Memory.byte = new Uint8Array(Memory)
Memory.short = new Uint16Array(Memory)
Memory.long = new Uint32Array(Memory)

This made the code easy to read and produced a nice programmer interface. Using this interface I produced a stack object to wrap an ArrayBuffer object for both the return and data stacks:

stack = function()<br>    var s = new ArrayBuffer(34*4);<br>    s.int = new Int32Array(s);<br>    s.long = new Uint32Array(s);<br>    s.tos = function() {<br>        if(arguments.length) this.int[this.long[32]&31] = arguments[0]<br>        return this.int[this.long[32]&31] <br>    ...

Using this stack interface object, I could then cleanly implement most of the instructions as method calls:

case 2: s.nos(s.nos() + s.tos()); s.pop(); break;

This sort of straight forward code was a direct translation of the behavior of the stack machine. And the primary reason for this was during the port, I had to translate what the C version did:

case 2: add(s); break;

Where "add" was a macro:

#define tos(S) SS[-1&31]
#define nos(S) S(S[-1-1)&31]
#define pops(S) S(S[-1--)&31]
#define binop(S,O)  nos(S) = nos(S) O tos(S); pops(S);
#define add(S) binop(S,+)

In the case of the C implementation a 34 cell array was allocated on the stack using alloca, and the pointer was set to the third element in the array, so that the offset & 31 would index the remaining 32 cells as a circular stack. A similar technique was applied to the ArrayBuffer implementation allocating 2 cells past the end of the stack as a stack index and a temporary buffer.

This design produced a human readable code base, but it definitely traded performance for human readability. Just how much so?

Well invoking the compiler as;

Compiler.eval('1600000 push 0 ~ : loop 1 + <- loop')

Which evaluates ~4.8 million VM instructions, produced:

Safari. 1s 4.8mips
Chrome. 0.5s. 8mips
Firefox. 10s. 0.5mips

But a quick rewrite later restructuring the memory access to use straight proxies without the memory object property lookup, removing the function call overhead of the stack accessor functions, and directly coding the assignments inline in the instructions produced:

Safari. 0.495s 9.7mips. (2x improvement)
Chrome. 0.085s. 56mips. (7x improvement)
Firefox. 3.6s. 1.5mips. (3x improvement)

These numbers are nothing short of astonishing. Safari doubles its performance when you remove the function call overhead. Firefox benefitted from. 3x improvement from removing the function call overhead. But Chrome! Chrome demonstrates what happens when the JIT can finally kick in. Not only did the function call overhead go way, but we obviously hit a code path that the JIT liked. In fact, Chrome's performance is not only approaching the performance of native code, but approaching what the FPGA implementation would achieve. As such the performance is at a level suitable for prototyping actual applications, although at almost 100x the power draw.

But why bother doing this at all? Why prototype a VM in JavaScript? Well, for one thing, we almost always have access to a browser, so experimenting is relatively inexpensive. We can always reset the simulation by CMD-R, reloading the page. Additionally, we have virtualized hardware support through the DOM API. With a little bit of code, we can emulate memory mapped I/O for the display, PCM audio, keyboard, mouse, touch pad, and even network via WebSockets and a socket proxy. The cFirth implementation can do these things through SDL, but it requires more effort to reload and test ideas. Also, it is much easier to debug the JavaScript based version.

But there is a very interesting side benefit of experimenting this way, JavaScript optimization techniques emerge that would go otherwise ignored. Having seen how efficient the JIT can be when fed code that it is capable of optimizing, it becomes evident that a JavaScript preprocessor is necessary. It is not enough to write JavaScript so that a human can read it, but also the optimizer too. As such, having a tool which can unravel human friendly code and produce optimizer friendly code becomes a huge win.

A first obvious step is to write a macro preprocessor for JavaScript to rewrite the JS source and recompile. Since we can have undefined references in a script, we can treat each macro as an undefined function initially. When the macro expansion occurs each of these function calls are replaced. Our macro expansion engine may just be a term rewriting engine, or a complex stream processing engine. With macro expansions, the core of the switch can be made more readable, but at the same time preserve performance.

Also inspired by this work, I did some more analysis on how different techniques implement JavaScript performance. Some of my preliminary findings hint at some rather surprising optimization techniques. The variations between browsers also makes micro-optimizations difficult. I've unearthed some rules of thumb which indicate a term rewriting engine could in some cases produce speed improvements of 50% to 1000% with no loss of readability, while actually producing safer, more fault tolerant code. More on that layer this week.