VM Object and Concrete Reality

Many years ago I wrote a small VM for a programming language, in which the VM itself was the base object, and all objects ultimately derived all of their capabilities from this VM object.  To make it work/safe I implemented a linked list of methods for handling overrides, so that conceptually an object could never modify an extant definition, merely trap new definitions with a modified behavior. 

This is the model used in Forth dictionaries, where in an program may locally mark the dictionary, temporarily create a bunch of definitions, and then forget them. SML/NJ also has a similar override logic, which makes mutating existing code hard. 

The more I think about this design, the more correct it feels. Consider a low level object:

  at: anAddress -> mov rdi, rax; mov rax, rdi*8; ret
  put: aValue  -> mov rdi*8, rax; add rdi,1; ret

That's it, our whole convoluted memory object. You'd use it something like:

  Memory at: 0xcafebabe; put: 0xdeadbeef; put: 0xfeedface

Which would generate the code:

  mov rax, Oxcafebabe
  mov rdi, rax
  mov rax, rdi*8
  mov rax, 0xdeadbeef
  mov rdi*8, rax
  add rdi,1
  mov rax,0xfeedface
  mov rdi*8, rax
  add rdi,1

Where 0xcafebabe is the CELL address (qword) andoo the 0xdeadbeef and 0xfeedface are 64bit values written out to memory at 0xcafebabe

A very simple peephole optimizer could eliminate a significant number of redundant moves generating:

 mov rdi,0xcafebabe
 mov rax, 0xdeadbeef
 mov [rdi
8], rax
 add rdi,1
 mov rax, 0xfeedface
 mov rdi*8, rax
 add rdi,1

The peephole optimizer need only identify two cases:

That will eliminate a single memory access and a bit of infer all preamble. The simple act of not fetching from memory when we don't need to will dramatically improve the quality of the output. 
Now with this object, we can now implement our entire VM, our object system, and the Memory object itself!  I can hand code the machine code to bootstrap this behavior on nearly any CPU, and the main bit is sorting out your register allocation. 

This too can be done with an ALU object, or a GPU object, or even a NIC object. All of the low level machine specific code can be implemented using the memory object, and all we need is a way to send it messages.  

  run: aProgram ->
        mov rbp, rax
  lp:  mov rbx,rbp*8
        add rbp,1
        mov rax,rbp*8
        add rbp,1
        call rbx
        jmp lp

what this does is set up a very simple run loop where you can lay out a series of message sends

    method address argument ...

And it calls to each in sequence.  So your code could look like:

  address of at: 0xcafebabe address of put: 0xdeadbeef address of put: Oxfeedface address of run: address of main system loop

And we can evaluate the original program example as interpreted vector dispatch code (much like an early Forth interpreter using CFA lookups). 

We can further build a compiler which looks up the body of the code fragments in our object and have it inline the fragments which are NOT self-referential or recursive. This is a minor change to the above code in that it COPIES the binary for each method up to RET and inlines the argument passing.  This  new compiled method can be jumped to as any other primitive. 

This now gives us a model of machine memory, which allows us to interpret byte code, and with a little ingenuity also the ability to compile machine code into memory. Let's say we add another transfer of control method:

  CPU call: anAddress with: anObject->
        lea rbx, rax*8
        pop rax;
        jmp rbx

We now have a model that allows us to take a cell address and execute it, with the return address falling back to whereever we happen to invoke CPU call: with:  where the object we are passing as the argument could be a literal value, pointer to a structure, or another method pointer. 

We can extend this model to capture all the ALU operations:

    add: aValue -> add rax,aValue
    sub: aValue -> sub rax, aValue
    mul: aValue -> imul rax, aValue
    div: aValue -> xor rdx,rdx; idiv rax, aValue
    mod: aValue -> 
        xor rdx,rdx; idiv rax,aValue; mov rax,rdx
    and: aValue -> and rax, aValue
    or: aValue -> or rax, aValue
    xor: aValue -> xor rax, aValue
    not -> not rax
    neg -> neg rax

Which now gives us a working VM that can mimic most of the integer math of C with signed values. Only things left are providing a straight forward model for comparing two values and vectoring execution based on those tests. 

    eq: aValue do: aBlock->
        pop rbx
        cmp rax,aValue
        jne _skip
        jmp rbx
      _skip: ret
    lt: aValue do: aBlock->
        pop rbx
        cmp rax, aValue
        jnl _skip
        jmp rbx
      _skip: ret
    gt: aValue do: aBlock ->
        pop rbx
        cmp rax, aValue
        jng _skip
       jmp rbx
      _skip: ret

With this we can model both conditional branching and conditional recursion.  We could use some Basic math to reduce the number of branching constructs but this is rarely necessary in practice. When you think about boolean math, we can do it with eq:  for example we could define

  Boolean ifFalse: aBlock ->
       CPU eq: 0 do: aBlock

The definition of ifTrue: could be:

  Boolean ifTrue: aBlock ->
        CPU not; eq: 0 do: aBlock

Which would mean false is 0 and true is not false.  This hints at a peculiar relationship that is embodied by the false, null, 0, () pun so common in other languages, but could also imply a strict definition for true as the hardware logical compliment of 0, which is architecture defined. This isn't all bad as Boolean objects should only respond to Boolean messages and the programmer ought not to care about the underlying representation. To rely upon that would be a bug. 

From here we can build Block objects which compile method calls into inline machine instructions, Parser / Compiler objects that turn strings of text into method calls, and all the various bits of a functional software stack. 

Probably most interestingly, we can augment our VM objects and extend our system at run time as we would in Forth.  By being able to add support for interesting hardware capabilities, but maintain a consistent core, we can derive additional benefits from our choice of hardware without incurring additional layers of context in our application. 

Now the assembler in these examples is intentionally limited. My choice of register allocation also very constrained. This is not a limitation of the technique, but an attempt to demonstrate how easily it can be applied in a register starved environment like a 8086 machine. Even using the full 16 general purpose registers of the x86-64 architecture pales in comparison to many MIPS based processors.  Clever programmers can already see how polymorphism could be used to segment register allocation, and greater efficiencies gained by using multiple registers for the ABI. I intentionally limited these examples to a top of stack in rax, with additional arguments passed on the stack to highlight how few arguments you need. I also tend to favor math ops that inline their arguments as few examples of code in the wild actually reduce 2 or more variables at once.   Those special cases like sqrt(aa + bb) are almost always best served by more optimal register allocations anyways. 

And that's kinda the point. Once you're not afraid to drop to low level code in your objects, programming fast tight assembler loops becomes fun again. And works beautifully with your highly abstract dynamic dispatch high level models.