Programming on the VM

In my last post, I described how I wrote a fairly simple dual stack virtual machine, and got the basic tooling to the point I could run and debug simple programs. One of the goals of this VM was to capture the essence of a largely zero-operand minimal instruction set computer. While it uses two cyclic register files for its stacks, and a handful of registers for addressing, it is the rather simple decode logic that provides the greatest opportunity for stumbling. Since zero is the fetch_instruction operation, a clever programmer can squeeze 14 instructions into a 64 bit cell. 12 full 5 bit instructions, 1 partial 4 bit instruction (top bit is 0) and the final instruction is 0 the fetch_instruction pointer which loads the next instruction through i. In this fashion, the VM resembles a very long word instruction set computer, but does not provide for out of order decode of the instruction. It falls to the compiler to pack in as many instructions as possible, but it also means that the programmer can find it more space efficient to compute values:

0 dup xor (takes 2 slots)
-1 0 not (takes 3 slots)
1 0 not neg (takes 4 slots)
2 1 dup + (takes 6 slots)
4 2 dup * (takes 8 slots)

And while it might seem like 8 slots are a lot to generate a number, the equivalent literal instruction takes between 13 and 25 slots depending on which slot the literal instruction occurs. While it would be possible to more cleverly pack literals, and a future version of the VM might even do so, there's other ways of putting a literal on the stack that use fewer operations, but come at the cost of a memory access. If you are careful to lay out your constants in memory in the order of operation it is possible to use the fetch_plus instruction to serially fetch your constants. Let's say you have a routine which calculates a typical polynomial, say: axxx + bx + c. Now I've unrolled the exponent form to more closely resemble what we'd do. Assuming that we have an array of coeefficients, and the value of x on the stack, such that x is on top:

src push
dup push push
pop dup dup * * fetch_plus
pop fetch_plus
fetch_plus +
pop source_set

This entire routine takes 21 slots, or 105 bits. In two 64bit cells we would have a couple slots left over. If we didn't feel like preserving the source register's value we could cut the number of operations further. The setup to call this would require pushing a literal address for the coefficients onto the stack. This itself would require 2 cells, with some slots left over. The price we pay here mostly is stack juggling necessary to preserve the address pointer and replicating the value of x itself. In this example, we use 3 slots on the return stack and 3 slots on the data stack to calculate the value. This means we can not call this sort of code safely in a function 5+ layers deep, if our return stack is 8 deep. This highlights a set of design principles which constrain our solutions:

One of the things that make variables a little less handy than in other stack machine implementations is that the source register limits the ease with which random reads can be processed:

source push
literal variable_address store_source
pop store_source

This routine takes 3 cells to implement, though there is a lot of free slots in the surrounding two cells. This code makes use of 1 return stack and 1 data stack position, to save and restore the source register, while reading from a variable. It would be the equivalent of "some_var @" in Forth. The simpler form, which does not preserve the source register:

literal variable_address set_source

Doesn't save any space really, but reduces the routine by 4 operations. Funnily enough, as the literal forces a memory read, this fetch requires two reads from memory. That being said, it can be terribly efficient if you need to read a single value multiple times or need to read a sequence of values using fetch_plus. Most data stream processing naturally falls into this later camp, and makes packing many ALU instructions into a single fetch very attractive, as it allows the VM to focus on processing.

So why would these trade offs be made in the face of how much complexity is necessary to comprehend while optimizing code? The answer is simple, because this style of chip can be described by a couple hundred lines of verilog, and run blazingly fast on an FPGA or even faster in an ASIC. By minimizing the fetch logic, we can reduce the number of wires connecting the ALU and the memory bridge. When I synthesize the first verilog edition of the VM, I will make it in 8, 16, 24, 32, and 64 but variants. As only so much RAM will be addressable onboard the chip, most of the address lines will be wasted. The data lines necessary for read/write will be a substantial part of the design. The timing for these operations is heavily dependent upon the number of wires and gates, and the longer the ripple carry for addition, the more cycles spent multiplying or dividing, the fewer cycle available for memory fetches. If you know your CPU can process 4, 8, 12,16,etc instructions per memory cycle (the multiple of how much faster the CPU is than RAM), will determine the sweet spot for cell sizing. 8 bits will always be the fastest to process, but will require a fetch every instruction. 16 bit will allow 3 instructions per fetch, and would let us use the high bit to encode inline positive literals (also freeing up an opcode in the process). 24 bit gives us 4-5 operations per fetch, where as 32 bits gives a solid 6 and a nice encoding for literals once again using the high bit. 64bit is probably totally overkill, but gives us 12-13 instructions to encode.

What this means is if memory is clocked at 600MHz, we could clock a core at 3.6GHz and see full utilization if we could fill every cell. In reality, we would likely only be able to support 1/2 that number. So realistically sustaining 1.2-1.8GHZ is the best we could use. All literals code would effectively reduce performance to 300MHz levels as doula ls the instruction fetches from RAM would occur. Even with RAM on die running at the same frequency, only the 16bit version could see close to theoretical performance.