A Design For A Simple Streaming Machine

In my spare time, I have returned o working on a new dual stack machine architecture. The design revolves around the idea that serial memory accesses are a core component of most real world computations. The idea is that there is a race track of circular memory where in multiple cores all chase each other around the track. The first process (pace car) lays down a trail of data, and is followed by subsequent processing units which each perform a small task on a subset of the data, carrying data from iteration to iteration on its stack. As long as each process requires a uniform amount of processing time on each piece of data, the lag from the first data acquired to the first result is:

time per job * number of jobs

Assuming we run every process synchronously, we can operate semi-lock free, with only a low energy wait state being enabled to keep each processing unit in step. Clock skew across the chip can result in potential read/write race conditions, but we can account for this with an additional buffer delay:

(Time per job + jitter) * number of jobs

And as long as the cost if the jitter delay is less than the cost of a write lock (say by asserting a signal until the write completes) the allows us to avoid shared state on the write lines. In fact, as the number of cores exceeds a trivial amount, the write assertion lock overhead also introduces it's own jitter and timing correction. We can limit which chips access which data in sequence by restricting cores to be geographically adjacent ( so they can have a direct wire connection between them ) but this makes it harder to allocate cores dynamically, and require additional attention to the routing layout.

That said, we probably want strips of cores which merely operate on data passed core to core via a direct link. Rather than a wake on clock count, these would wake on write assert. Rather the share memory two or more cores would share a common bus in a source -> sink relationship. When the source core asserts a write the sink cores would read the message and perform the relevant operations. These operations could anything the core or a program running on the sink cores can do. The sink cores could then serve as a source for other sinks, or even the original source.

For example, let's suppose our source core is processing incoming requests. It asserts that a new request has arrived on its bus, and the three attached sinks wake: response, auth, and logging. The source then switches to sink mode, waiting for two responses: ack/nack on auth and response. As soon as the source core receives two acks, it responds to the requestor. Should either auth or resp return a nack, then the original source would drop the connection and return to source mode (listening for I/O).

As the response and the auth both access different data sets, they can operate purely in parallel. These cores themselves may do no work on their own, but may delegate to other cores, serving only a command and control or marshaling function. The third logging core may be merely spooling off data to long term storage, and merely note the values of the request, it may also listen as a sink to other cores logging any ack/nack sent to the controllers. It's exact role is dependent upon the topology of the network connecting the cores.

This architecture is an application of techniques used in actor model software applied to physical silicon. In many of the real time stream processing systems I have been designing of late, the components of the program have been decomposed to operate in parallel or serially depending upon scaling properties of the individual components. For example, if a process take a long time to complete, such as a machine learning algorithm, but is only dependent upon the input (ie does not requires hysteresis), then the farming out of the job in parallel requires a core to perform as a multiplexer, asserting a signal each payload to distribute to a single sink among many.

Should a process require hysteresis over a subset of the data, then it is likely that the topology will require a core to work as a router, partitioning the inbound stream according to a set of rules. These separate streams are then processed by cores dedicated to the partition. In this way the working sets can be reduced, and garbage input filtered out of the history dependent processes.

With this in mind, each of the CPU designs I am working on are zero-operand instruction set machines, with a heavy focus on I/O. I am experimenting with using register files as circular stacks of various sizes, and directly attaching block RAM to specific cores to provide for local persistence. The idea here is that individual cores would be represented as actors in the programming model, and some actors can access external data, some can store data, and others can output data. By being able to configure multiple topologies by programming the cores, it should be possible to build the software and plug it together like Lego bricks.

I have also begun toying with the idea of designing stripped down cores, which have an even more reduced instruction set that merely perform useful routing functions. While he default design treats every core as a MIMO, it may be easier to model them as SISO with SISO, SIMO, MISO, and MIMO interconnects. This tracks more directly the software analogs, but introduces additional synchronization issues should an application wish to change its topology on the fly. This can also mean that many advanced routing techniques would require more resource utilization and not less, as the bus address logic would be less flexible. The flip side is that many common cases would be optimized and use significantly less power, such as fanout and point to point.

Oddly enough, when it comes to designing these cores, the process of synthesis in Verilog is not the hard part. The Xilinx Spartan6, I am using to prototype, has a fairly hefty synth, compile, deploy cycle. As a result, there is often plenty of time between builds to eyeball the source. I've found Imcan usually catch my mistakes before the bin file is ready. These designs are only a few hundred lines long. The real problems come in with designing the test software.

The correct test software needs to load into each core and the topology configuration changes need to be accounted for. Often it is impossible to distinguish between a routing error and a design flaw, and as such I've found it necessary to add debug signals on pins. When you're debugging a software based messaging system, you can stick your probes on the wires between software components. Hardware on the other hand, the act of putting a probe on he wire often changes the value you are trying to observe. I currently have 8 LEDs I use to track data flow, and override the clock with a hand toggled switch. But this doesn't help when debugging timing glitches and race conditions.

Should these problems be worked through, I foresee a day in the future when I have computing fabric with billions of processors. These processors do little more than objects in current software designs, but operate in vast networks processing unending waves of data. Different regions of the fabric will specialize in different skills, while others will remain general purpose, being imprinted with new behaviors and pressed into service on demand. These computers will be so more massively parallel than commodity hardware I use today, that current wisdom will no longer apply. It will be a fun day.