DAVE'S LIFE ON HOLD

Generic Servers, OTP, and Why More Erlang

In around 1990, back before I had this thing called Internet access, I had a 2400 baud modem and a BBS. My BBS was really not so much for general use but for experimenting with networked computing. I wrote a small programming language for it that looked a lot like TCL at the end, except that everything was an object (as per smalltalk) and the under the hood was mostly Forth. It had a separate green thread for every object and they all did message passing. I would not hear about Erlang for 7 more years.

In 1994, I had access to two university networks and a lab filled with machines. I also discovered Self (then 9? Years old) and would be heartbroken a year later when Sun dropped it for that Java crap. It to had objects and message passing and networking. I then spent most of my time programming in Self and Java's bastard child JavaScript.

Erlang is now 25+ years old, a descendant of Prolog, and has all the same key characteristics. It let's you build objects, aka OTP gen_server, based applications through messaging. It does it in a model that works very much like Ajax without the cruft of manually parsing and crafting each and every stupid request. It also handles callbacks, out of band messaging and error handling consistently and with same default behavior. Additional critical features include automatic supervisor services which manage availability, off the shelf DBMSes, and a large library of high quality infrastructure code.

Full disclosure, over the past decade I have written production servers in: C, C++, Java, Perl, Python, Ruby, Lua, JavaScript, Forth, Ocaml, Common Lisp, Smalltalk, and Erlang. One of the common problems one faces in all these languages which does not exist in Erlang is updating features at run time without crashing the running system. In C/C++, I wrote a module system that dynamically loaded dlls which each had to be constructed meticulously via macros to avoid symbol collision and avoid ABI changes. In the more dynamic languages, it requires hiding everything behind proxy objects and loading new versions in separate instances off the VM. So rather than build a full app in the language, a separate C/C++ runtime managed the child language contexts. In Ocaml, it ended up being roughly the equivalent of implementing the gen_server of Erlang, and the messaging, and the supervisor, and at which point I basically imemented Erlang in Ocaml.

So everytime I begin a new project, I usually ask myself which language I feel like writing in. Then after I finish deciding not to write in Ocaml after all, I am left with JS and Erlang. From the pure practical standpoint this is driven by not having to write all the nonfunctional requirements for the umpteenth time. The OTP solves those problems sufficiently well, and I can move on to do fun things.

This past Friday, after writing up the Varnish post, I decided to write a new service in Erlang to address the short comings of Varnish in a peculiar use case I am currently fighting with. After 4 hours of hacking some templates generated by rebar, I had a basic solution working. I am going to spend about another 8 hours before I release the first public version. But in the end it will be about a week of hacking for a couple hours at night to build a production quality single purpose server.

Part of the reason that it takes so little time is that the application can be composed of multiple servers which all communicate with eachother in such a way that any single component can crash without taking down the application. For example, a device which checks CIDR notation addresses against a candidate for whitelisting can be written such that a failure to match simply errors out. The caller can then simply handle the error message rather than have a special no-result handler case. Typically, you can implement a round-robin or random dispatch retry logic, and when you finally error out, your caller itself crashes. Your caller then gets an error response, and can decide to pass it back to the client or bail.

This approach to reliability where you just bail seems counter intuitive. How can just bailing be better than handling errors? The answer lies in parallelism. The system as a whole can continue to operate even when one component, one request, one node happens to fail. As you scale an application up, your mean time between failures usually remains constant. As a result, the gross quantity of failures is only going to increase. Say you have a process that fails 1 in 10,000 requests (4 9s reliability) and you process 1,000,000 requests an hour. Your system can expect 100 failures an hour, or almost 2 every minute! That means failure must be considered pretty normal.

Now if you build your software to fail soon and fail often, you can take advantage of this property! You can code for the happy path, and let your successes process as quickly as possible. Your failures on the otherhand also become inexpensive so retrying a request also becomes cheap. Since your services are decoupled and automatically restated by the supervisor, any given process can crash and only effect that one instance. If the error isn't readily recoverable, your code will eventually bubble the error up to the user, where it can be handled responsibly.

Ultimately, by not trying to account for and guard against failure, your solutions become simpler. By assuming failure is normal, they also become more robust. And because Erlang gives you the frameworks for writing solutions this way as the preferred manner, it is a far superior tool for building fault tolerant systems.