DAVE'S LIFE ON HOLD

When things go wrong, intentionally

One of the mistakes software engineers make on a regular basis is assuming that the happy path is the correct path. For years C programmers have written code like:

   if (error = this_function_returns_zero_on_success())         exit(error);
  " alt="
        fprintf(stderr, "Error Code: %d\\n", error);
        exit(error);
  " />

This process of testing for error and exiting early is often seen as a dangerous pattern as it requires both constant vigilance and does not guarantee a satisfying user experience.   

To make proper use of this model requires a process to fork() and handle all behavior that could fail in a child process, and catch/retry upon receiving a sigchild signal.   In the parent monitoring process a main program loop processes events, forking off workers to handle each event. The event monitoring process itself is also monitored by a supervisor process that simply fork() and then waits for a sigchild from the monitor process, restarting it as needed.  All of this supervisor/monitor/worker programming pattern requires considerable attention to synchronization  and interprocess communication.   Because of this few programs exhibit these properties, though most mission critical ones do. 

Some clever programmers introduced a technique advanced C programmers used reliably for years, non-local exit, to provide a more reliable method of handling errors in a single process model.   Non-local exit uses a jump buffer on the heap, which holds the address of the stack frame to which to return, and a long jump function that discards the local context and jump to the old frame. This might sound complex, but most programmers use it in languages like C++/Java in the form of try/catch exception handling.  Each try block sets a long jump buffer and hooks each function call to invoke a chain of jump buffers until it non-local exits to the top level runtime which then exits the program with a fatal exception. 

This model of programming adds the ability to NOT CATCH errors at various levels, allowing more subroutines to avoid handling errors.  If you are building a complex application and have poorly defined error handling behaviors for a reusable model, deferring in this manner makes sense, as it allows a higher context to identify recovery modes. It does not however provide the sort of robustness found in the fork() & fail() model.  It's goal is not robustness but reuse.  Exception handling is a crude form of intraprocess communication which allows modules to talk to their enclosing context. On rare occasions clever programmers make explicit use of exceptions to communicate non-error conditions as well, or to simply short circuit a deeply nested context as in a recursive function.  This form of clever programming is generally frowned upon as minor changes to the exception handling at different contexts layers produces new code paths at each lower context.  This automagical "come from" behavior added each catch statement you add transforms the state machine into a spaghetti graph of call/return sites. 

When we move on to building distributed systems,  these issues get compounded by additional layers of messaging, synchronization, and conflict resolution.  In a distributed system, throwing an exception and catching it may not make sense as the node capable of recovery may not share any state with the failing node. Additionally, passing a message on error is not a viable strategy as the failure to send or receive messages may be the exact error in question.  As such a distributed system requires a model where in workers do work, and send messages back to their managers. The managers require a means to timeout tasks, and resolve conflicts that result from competing workers.  Finally supervisory nodes need to ensure that managers and workers, as well as other supervisors, remain available, restarting them as necessary. 

This pattern is the basis behind Erlang OTP's supervisor trees, the generic server model, and the finite state machine module.  By breaking the problem down into a collection of actors which maintain a small piece of a distributed statemachine, each Erlang process is free to fail fast and fail often.  The basis of most error handling in Erlang is log the state when the error occurred and die.   If your process was linked to another, then the linked process will receive a signal indicating that your process died. If a process doesn't want to trap a given error, it can discard the signal or unlink itself. Since signaling occurs even across nodes, a system gain robustness by physical separation of supervisors and workers.  If a worker dies because of a hardware or system failure, the remote supervisor can easily respawn the offending node at a new location.   Failure in a distributed system is a normal mode of operation, and should be embraced as such. 

Consider a system with 1000 nodes, each of which servicing 1000 events per second. Those aggregate nodes process 1,000,000 events every second, 60,000,000 every minute, 3,600,000,000 every hour. With a 7 9s SLA 0.9999999, we expect to see a failure every 10 seconds, 6 per minute, 360 per hour.  Most systems in the wild never reach this scale or level of service.   A single second of downtime once every 3 hours on a single machine is enough to account for this level of service outage.  Mind you, with a 3 year mean time between failures, with a 1000 nodes, you can expect to replace 333 nodes per year,  or roughly one per day at this level.   Transitioning a set of processes from a failing node each day, in 8seconds, lets you meet a 7 9s SLA. For this to happen, you basically need a hot standby running which can assume the role of any node in 8 seconds. Daily maintenance means racking and repairing/replacing a node per day. 

This still assumes your software is perfect and is never a source of failure in itself, that all resources are constantly available, and that  no conflict occurs.  In the real world none of these things are true.   If we reduce our operational capacity to 60%, and leave ourselves 40% capacity to flex our resource utilization, we can hedge against resource starvation. It can still occur, especially when buggy code produces an internal DDOS attack, but it becomes less likely under normal operating conditions. This trade off effectively trades resources for failure rate, and can be a cost effective method of reducing risk. Similarly, we can expend additional computing resources to make operations themselves redundant, and rectify conflicts by retry logic. This approach is common in mission critical systems where a single system failure can be catastrophic. Once again we reduce our operational capacity (say down to 20% peak), in exchange for using our 60% in triplicate.    This is often the case with numerical processing of critical financial or telemetry data where the frequency of CPU and RAM errors manifest themselves in corrupted calculation.   In these cases the best option is to fail fast and retry the calculation, hoping that a momentary electrical glitch due to power fluctuations was the root cause. 

Each of these approaches to mitigating failure have a special property which becomes apparent at scale:  failure is ok, it's normal, and you better comfortable with it.