Jawas Revisited

The past week, between sessions at Embedded World, I spent some time working on improvements to Jawas. This project has laid dormant for quite a while now. I've only made a few modifications since 2009, mostly to adapt to changes in platform support. This time I was aiming towards fixing some long standing issues with zombies.

Around the time I resumed linux support, I introduced a signal handling bug. Jawas is a full forking server, each connection receiving it's own process. While Jawas uses event driven I/O like nodejs (but predating node), I found strict process isolation necessary to gurantee availability. It doesn't matter how many hundreds of requests per second you can support if a single buggy request can kill hundreds of requests. This becomes incredibly important for behavior of a server under stress.

Some basic testing of static and dynamic pages on my MacBook Pro showed flat response for 1 to 120 concurrent connections, servicing between 800-1000 requests per second. The only reason I maxed out at 120 concurrent connections was I ran into the hard coded max user process limit in the Mac OS X kernel. If I were running the sever variant I could override this setting, but Apple in their infinite wisdom cripples the desktop.

When I extended the tests to run in the tens of thousands of requests range, I began to see some interesting behavior. The server would oscillate between the prior max and exactly 1/3 of that amount. Watching the CPU load of the Jawas process nothing had changed, but the response times tripled. Meanwhile a new process was at the top of the Activity Monitor each time the performance dipped, syslogd! Yes syslog was failing to keep up with the volume of logging generated, and was then pegging the CPU for 10+ seconds to catch up. Not only was Jawas performing well under load, it was crushing a core subsystem that couldn't handle the sustained load.

I'm going to try and replicate this behavior on Linux, but tracing the app under OS X put the blame squarely on syslogd for the performance instability. That said most requests which hit the database crush Postgres long before this is an issue. In practice, a database round trip adds an order or two of magnitude more latency to the request, reducing throughput into the 10s to 100s of requests per second range.

All of this made me confident that Jawas could still be relevant in the current world order. It has a number of features which make it superior to the hip new frameworks. So I started thinking about how I would "modernize" this 10 year old application. There are some obvious low hanging fruit:

The motivation for most of these changes was after implementing most of RFC6455 in a day, that I would need to make the request handlers pluggable to support the new full duplex flow. That got me thinking about how one would make use of a WebSocket path. Typically a Websocket URI would terminate at some script which would parse and handle the messages, which would require a WebSocket MIME type and handler. This is awkward as one could conceivably want to handle messages via Javascript or Lua or even a C module. There's nothing in the HTTP request that makes how to dispatch this obvious either, though some WebSocket clients can specify a subprotocol via Sec-WebSocket-Protocol, not all RFC6455 implementations in the wild support this header. Also no major browsers allow you to add your own headers to WebSocket requests. As such, I need to change the current dispatch logic to break the relationship between path artifact and MIME type dispatch.

If I'm going down that route, I might as well take the opportunity to separate responsibilities more. Jawas has always supported using a path to specify the continuation of an asynchronous operation. Basically this allowed me to avoid resuming an old VM session when a call out to a 3rd party service took a long time. I could save state in the DB and resume in a new process with a fresh VM. If the 3rd party API involved a callback, the code looked exactly the same. The http callback would invoke the related script, and a new instance of the Javascript or Lua interpreter would perform the desired response.

Originally, Jawas didn't have stream processing because I used the ConnServer as a live event engine. Bots written in Python, Perl, Erlang, Ocaml, or Lisp would communicate to each other via the ConnServer socket interface. URL encoded messages (this server is pre-JSON and supported Flash XMLSocket connections & Flash's URL object decode functions) would interact with objects modeled by C++ objects and backed by database blobs. With this in mind, Jawas really only existed to deliver some basic webpages, handle login, handle payment processing, and serve up the Flash games. There wasn't even need to serve game data from the Jawas database as ConnServer and Jawas would use separate databases and schemas.

Now that ConnServer is 12 years old, and Jawas 10, it might make more sense to adopt the routing and messaging features of ConnServer in the Jawas model. This way the ConnServer notion of "rooms" could be married with Jawas's paths, and a WebSocket connection binds you to a message queue with a fanout exchange. ConnServer allowed for routing to specific recipients within a room as well, and we could use the concept of topic routing to model that too. With scripts, we can model those too as paths that can be invoked by sending them a message. An HTTP request is just a message after all! And if we consider sever side includes as message sends to URLs, we can replace all of the inline scripts with script URIs. This opens up a number of possible language implementations, remote services, and realtime proxy behavior.

Jawas has had basic proxy behavior baked in for years. Many early applications I built tiers of Jawas, where a frontend Jawas would request a value from a backed Jawas via a HTTP request. Many times these requests would also integrate 3rd party APIs like Facebook or Google Checkout, and so the isolation of the backend processing from the frontend server was also desirable. This concept of isolation of concerns has been core to the design of most of my systems. The principle being that the failure of a given subsystem exposes a limited attack surface for any potential security breach. While I can't guarantee any given process won't be compromised, I can promise that the compromised system will be limited to only that data the attacker would normally have access to. As the data is either part of the request or part of the response, the attacker will by definition know the former and by nature of a response be delivered the later. For example, knowing that the purchase callback failed is true of both an attack and a legitimate failure. But as the failing system only reveals the state it chooses to send, there is no opportunity to sniff that data few the originating process's memory space, as it simply doesn't have access.

My thinking on system level access is leading me down the path of removing the file system too. Right now the parent process mmaps all of the subfolders of the current working directory, and exposes those files as URLs. The server by design will not attempt to read new files unless sent a SIGHUP, but will reflect any file it has open if modified via the VNODE event on the file. This was done so that new directories / files could be deployed but not released until ready. For instance let's say our web app is running out of ./jawas.ws/v1/ and we wanted to release a v2. We could deploy all of our files to ./jawas.ws/v2/ and even run a Jawas on another port with these files mapped, and then choose to release to the public Jawas by sending the original a SIGHUP. The original listener would then rescan the ./jawas.ws/ directory and find the new version. This works fine as long as you can replicate file system state across all of your nodes.

As an alternative to rsync, I am thinking of borrowing ConnServer's object cache concept. Each asset in the environment would be database backed and have an explicit TTL. New versions could be deployed and multiple versions could be deployed simultaneously for AB testing. Any Jawas pointed at the same source database would ondemand cache the values, and the database itself via triggers could notify every attached Jawas of a component change. This also allows for tooling to update content in the db via Jawas's HTTP and WebSocket interfaces. We trade a simple file system for a DBMS, but the tooling becomes more reliable at scale. Lastly, it makes containerization of Jawas easy as the only configuration value becomes which database to attach to for content. It would be easy to supply this to docker as an environment variable.

With these changes in place, adding support for new protocols becomes much easier. As each request or message send gets translated to a Jawas API path and message send, new protocols need only be mapped into the URL space. Since multiple producer/consumers can attach to a given URI, we can have all HTTP traffic mirrored to MQTT or AMQP. Similarly, as a HTTP response becomes no different from a single shot WebSocket message or CoAP response, any script would naturally support all available protocols. With the database bindings, one could even stream DB requests to a given stored procedure, allowing for new workflows. HTTP long poll or HTTP/2 pipelining could also be directly supported.

By breaking the Jawas client handling code into an extensible event system, we also open the door to future unforeseen workflows. Rather than simple IN -> OUT and OUT -> IN, Jawas could model entire statemachines with simple event extensions. User defined events will generally fall into either a message in or waiting for a message to go out. Timer events can also be directly supported too, which will remove one of the extra signaling cases currently used by the system. Periodic events, which are currently based on timeouts, could be made first class timer events, simplifying the event model and error handling.

Having worked on a code base is C for so long, what amazes me most is how maintainable a piece of software can be if you have a sound understanding of it. I can still make significant architectural changes without fear of rabbit holes. Revisiting Jawas leads me to believe code maintenance is a political and not a technical problem.