Lessons Learned From Jawas

Years ago I took a project that originally was written in C++, and ported it to C w/ embedded Python for scripting. The original Connection Server (CS) was a multiuser chat server that could easily support 300 active game clients on a cheap $500 AMD K7 board.   It eventually learned to speak HTTP/1.0, and so in the rewrite I spent less time on chat and more on serving HTTP/1.1. The rewrite also introduced a version of Spidermonkey as the scripting engine to replace the python. The reasoning here was all of the developers who were working with me knew JavaScript far better than Python, and  by tweaking Spidermonkey I could get adequate performance out of it. It also didn't hurt that I had a background in game server design and understood how I could avoid repeating work. 

Jawas like Node is a C10k style application, which uses mmap, epoll, and kqueue to provide efficient kernel level signaling. Jawas aggressively caches results, and in fact will only calculate a given resources based on a fixed update period. For example, by default Jawas has a period of 200ms, meaning that any request for a given URI will return the same result for 200ms. If a series of POST or UT events update the state of a resource that change would only become apparent at the start of the next 200ms period. This design came out of a desire for consistency where many game clients would typically request a status update every 200ms (which is enough headroom for most network latency) and update accordingly. Conflicts between client requests would typically be resolved out of band, so the server never waited for backend state machines to update.  Lesson here is learn to create a perception of consistency, even though you system state may be inconsistent.  As long as you can resolve conflict within a the period of resolution, there is no conflict from the user's POV. 

Since I had been building systems where thousands of concurrent users would be hitting hundreds of servers several times a second, I got how important maintaining a consistent view could be for performance. Consider the following:

10k concurrent users @ 1 request/200ms = 50k requests per second. 

At 200 users per server, we needed   50 just to handle the load (running at a steady 60% of capacity).   This meant that each server needed to be able to process 1000 requests per second. A request would have about 1ms on average to parse, dispatch, process, and return.   Most servers in production today are 10x slower than this. Most static asset servers are on par with these numbers.   The lesson learned here is you can get static asset performance at an acceptable cost by designing pseudo-static data sources.  This is not the same as relying on dynamic caching, but rather you can gain the benefits of caching by designing data that has a known period of updating 

So since we don't have a lot of spare cycles, we need to not only design such that all backend processing is cheap on DB access, etc.  We need to design such that requests produce similar results for a significant percentage of our users.  This is true of any app that relies on a coherent caching strategy to scale. You need to make sure your shared resources are doing work that can be shared, and your individual resources (clients' machines) are doing the personalization.  Jawas was designed to scale this way.  The lesson learned was shared resources are a precious commodity, but your users are cheap as free, push customization to the edge  by design, never serve a unique asset.  

Part of the reason benchmarks often broke on Jawas is that the comparisons never took into account this design objective. Sure Jawas could serve 6000 requests per second for a page that rendered the current time, only the time would have a granularity of 200ms. Sure the comparable Catalyst app would get you higher precision time, but as it would crap out at 40 requests per second, it would have an effective resolution of 25ms.  This might be attractive if an 8x improvement in timer resolution was important, but you'd spend 8x in hardware, power, and rack space to get it.  The lesson here was optimize for what you need, and know your trade offs.   Better to sacrifice resolution than to pay for something you don't need. 

Probably the best example of this is something that really pisses me off about Node and Nginx, process isolation. Jawas on a per process basis was event threaded. The JavaScript environment was initialized at server start and fork + COW provided a way to have my cake and eat it too. Since all of the files on disk were also mmapped, paging hardware replaced any need for future open calls. When a vnode event fired off, the connection manager would remap, and all future children would get the update. The life of a child however was tied to a particular TCP/IP session, and like Apache would isolate that user to a single process. If a code module caused a child to fault, only that user session would be lost. And since we would only run 200 users per box, we'd typically have no more than 202 Jawas processes running. 200 clients, connection manager, and a restart monitor). Having your server capable of auto restarting on error is hugely beneficial when testing and rolling out binary modules. It also helps in production as overall availability matters, and dropping 1 connection is 200x better than dropping 200. In a massive server hiccup, most keep alive connections would not suffer a disconnect. Additionally, when memory leaks were discovered in production, turning on a max requests per child allowed the server to GC on a per user basis. Since the client was tasked with reconnecting, we would see connection migrate seamlessly in under 400ms. Most users and developers never even noticed.   The lesson here is that stability comes at the cost of enforcing isolation. The context switching cost must be paid if you want to scale reliably. Choosing the right hardware us critical to this effort as well. 

Most of the lessons here are applicable to every system that is highly distributed. My original Connection Server used data ase backed serialized C++ objects, and had a in memory distributed object store (like an early precursor to Redis or Memcache).  Jawas servers used even more aggressive memory cycling, and would spend 65-70% of their run time in kernel space. (avg in production was 68% sys time).  When writev is the top thing you spend processing on, you know you've reached you limit to optimize out the problem.   Since the first apps were Flash heavy, all of the presentation we did was client rendered.  When we switched to HTML5, we persisted in the style of shipping data only. This scales beautifully, and performs fantastic especially when your screen refreshes are tied to the browser animation loop.   And that's the key. Performance at scale is not about how fast you can serve an individual request or x000 requests/second. It is about. Creating a perception of consistency with reliable performance at a predictable cost.   And that's a lesson most platform developers have yet to learn.