VM Object and Concrete Reality
Many years ago I wrote a small VM for a programming language, in which the VM itself was the base object, and all objects ultimately derived all of their capabilities from this VM object. To make it work/safe I implemented a linked list of methods for handling overrides, so that conceptually an object could never modify an extant definition, merely trap new definitions with a modified behavior.
This is the model used in Forth dictionaries, where in an program may locally mark the dictionary, temporarily create a bunch of definitions, and then forget them. SML/NJ also has a similar override logic, which makes mutating existing code hard.
The more I think about this design, the more correct it feels. Consider a low level object:
at: anAddress -> mov rdi, rax; mov rax, rdi*8; ret
put: aValue -> mov rdi*8, rax; add rdi,1; ret
That's it, our whole convoluted memory object. You'd use it something like:
Memory at: 0xcafebabe; put: 0xdeadbeef; put: 0xfeedface
Which would generate the code:
mov rax, Oxcafebabe
mov rdi, rax
mov rax, rdi*8
mov rax, 0xdeadbeef
mov rdi*8, rax
mov rdi*8, rax
Where 0xcafebabe is the CELL address (qword) andoo the 0xdeadbeef and 0xfeedface are 64bit values written out to memory at 0xcafebabe8
A very simple peephole optimizer could eliminate a significant number of redundant moves generating:
mov rax, 0xdeadbeef
mov [rdi8], rax
mov rax, 0xfeedface
mov rdi*8, rax
The peephole optimizer need only identify two cases:
- load a static into rax only to move it to another general register
- two sequential mov instructions target the same register
That will eliminate a single memory access and a bit of infer all preamble. The simple act of not fetching from memory when we don't need to will dramatically improve the quality of the output.
Now with this object, we can now implement our entire VM, our object system, and the Memory object itself! I can hand code the machine code to bootstrap this behavior on nearly any CPU, and the main bit is sorting out your register allocation.
This too can be done with an ALU object, or a GPU object, or even a NIC object. All of the low level machine specific code can be implemented using the memory object, and all we need is a way to send it messages.
run: aProgram ->
mov rbp, rax
lp: mov rbx,rbp*8
what this does is set up a very simple run loop where you can lay out a series of message sends
method address argument ...
And it calls to each in sequence. So your code could look like:
address of at: 0xcafebabe address of put: 0xdeadbeef address of put: Oxfeedface address of run: address of main system loop
And we can evaluate the original program example as interpreted vector dispatch code (much like an early Forth interpreter using CFA lookups).
We can further build a compiler which looks up the body of the code fragments in our object and have it inline the fragments which are NOT self-referential or recursive. This is a minor change to the above code in that it COPIES the binary for each method up to RET and inlines the argument passing. This new compiled method can be jumped to as any other primitive.
This now gives us a model of machine memory, which allows us to interpret byte code, and with a little ingenuity also the ability to compile machine code into memory. Let's say we add another transfer of control method:
CPU call: anAddress with: anObject->
lea rbx, rax*8;
We now have a model that allows us to take a cell address and execute it, with the return address falling back to whereever we happen to invoke CPU call: with: where the object we are passing as the argument could be a literal value, pointer to a structure, or another method pointer.
We can extend this model to capture all the ALU operations:
add: aValue -> add rax,aValue
sub: aValue -> sub rax, aValue
mul: aValue -> imul rax, aValue
div: aValue -> xor rdx,rdx; idiv rax, aValue
mod: aValue ->
xor rdx,rdx; idiv rax,aValue; mov rax,rdx
and: aValue -> and rax, aValue
or: aValue -> or rax, aValue
xor: aValue -> xor rax, aValue
not -> not rax
neg -> neg rax
Which now gives us a working VM that can mimic most of the integer math of C with signed values. Only things left are providing a straight forward model for comparing two values and vectoring execution based on those tests.
eq: aValue do: aBlock->
lt: aValue do: aBlock->
cmp rax, aValue
gt: aValue do: aBlock ->
cmp rax, aValue
With this we can model both conditional branching and conditional recursion. We could use some Basic math to reduce the number of branching constructs but this is rarely necessary in practice. When you think about boolean math, we can do it with eq: for example we could define
Boolean ifFalse: aBlock ->
CPU eq: 0 do: aBlock
The definition of ifTrue: could be:
Boolean ifTrue: aBlock ->
CPU not; eq: 0 do: aBlock
Which would mean false is 0 and true is not false. This hints at a peculiar relationship that is embodied by the false, null, 0, () pun so common in other languages, but could also imply a strict definition for true as the hardware logical compliment of 0, which is architecture defined. This isn't all bad as Boolean objects should only respond to Boolean messages and the programmer ought not to care about the underlying representation. To rely upon that would be a bug.
From here we can build Block objects which compile method calls into inline machine instructions, Parser / Compiler objects that turn strings of text into method calls, and all the various bits of a functional software stack.
Probably most interestingly, we can augment our VM objects and extend our system at run time as we would in Forth. By being able to add support for interesting hardware capabilities, but maintain a consistent core, we can derive additional benefits from our choice of hardware without incurring additional layers of context in our application.
Now the assembler in these examples is intentionally limited. My choice of register allocation also very constrained. This is not a limitation of the technique, but an attempt to demonstrate how easily it can be applied in a register starved environment like a 8086 machine. Even using the full 16 general purpose registers of the x86-64 architecture pales in comparison to many MIPS based processors. Clever programmers can already see how polymorphism could be used to segment register allocation, and greater efficiencies gained by using multiple registers for the ABI. I intentionally limited these examples to a top of stack in rax, with additional arguments passed on the stack to highlight how few arguments you need. I also tend to favor math ops that inline their arguments as few examples of code in the wild actually reduce 2 or more variables at once. Those special cases like sqrt(aa + bb) are almost always best served by more optimal register allocations anyways.
And that's kinda the point. Once you're not afraid to drop to low level code in your objects, programming fast tight assembler loops becomes fun again. And works beautifully with your highly abstract dynamic dispatch high level models.