Self Healing Systems

Self healing systems design is a topic of professional interest for me for over a decade. It is the point at which actual systems architecture and engineering start and amatuerism ends.  If you find yourself staring at block diagrams involving thousands of servers, millions of clients, and billions of connections, you start to understand how biology works.  The thing you have to deal with is more like a living breathing organism than a mere aggregation of solid blocks of stone. 

Traditionally these sorts of systems revolve around two concepts: availability and reliability. Availability is the capacity to accept new input and generate new output. A system is available if it responds; it may not give you a response you like, but it responds.  Reliability is the capacity to provide assurances that a system does what you told it to do.  That is to say, there is a sort of contractual obligation on the part of the system to ensure what it said it did was actually done.  A system may be reliable but not be available, usually due to some sort of blocking operation. 

A self healing system is a reliable system with high availability that automatically detect and fixes damaged or underutilized resources.  Like a living organism it recycles and clones individual cells in its body to maintain the integrity of the system as a whole.   But key to this design is that no single component of the system is unique or performs a unique function; though they may be given unique names. 

Personally, my favorite model for building self healing systems comes from a stem cell. The concept is each cell in the system has the inherent capability to discover its own identity and differentiate itself accordingly.  In the DNA of each cell lie the instructions for performing all of the functions of any type of cell, and the inherent capability to communicate with other cells to automatically coordinate function. 

In a software systems setting this means 3 fundamental things:

This baseline let's us establish some ground rules for talking about how we build systems.  All code goes everywhere means any hardware can serve any function with the correct configuration.  Using DNS for configuration allows us to use DHCP for assigning identity, and DNS TXT records provide a scalable method for distributing additional configuration information.  As a DNS TXT record can easily store key=value pairs, it is possible to write applications to rely upon a set of environment variables discovered via DNS, and keyed to the sever's hostname. 

The final bit is using a message queue with routing to control the behavior of each server and service.   By running a management daemon on each node that listens for control commands, it becomes possible for services to notify each other of changes in their environment.  For example a web service may notify a proxy that it has come online, making itself available to the pool.  Once in play, the web server may find itself experiencing too much load, and request that the proxy lower it's weight to reduce the number of requests.  Similarly, a management daemon may notify a proxy that an additional shard of the database has come online, and request an update to the partitioning logic to add it to the ring.  This same request may trigger a rebalancing as new hashes are computed across the cluster and data moved accordingly. 

By building systems that communicate we allow operations to focus on the cafe and feeding of the beast, by replacing hardware or adding capacity.   Rather than struggling to envision how to solve problems, the communication channels can be monitored for data about the health of the system, and capacity planned accordingly.  The fact that two parts of the system may communicate is not enough, they must also communicate consistently. To this end, the application layer protocols usually should standardize on a common format, to enable interservice communication. JSON is a particularly useful encoding scheme, but more important is the concept of using symbolic expression to program the cluster. 

Being able to read requests and send responses is nice, but more important is the ability to make requests and handle the responses.   If you view all communication as an asynchronous affair you have two basic behaviors:

  Message ->
   -> Message

A service which generates messages may expect nothing in return; such as a keep alive service which just keeps a persistent connection open.  Another service may be a recipient only; a sump for log data is a typical case.  But most services will involve some level of push and pull. It is important that all messages sent between services conform to a model when in the act of sending a message is a one way affair.  Fire and forget, if someone is listening it is their responsibility to tell you.  Equally important on the reciever's side is the ability to ignore unwanted messages.  An recipient should never error out simply because it was the unfortunate recipient of an errant message.  Mapping incoming message to state machine transitions is important, with most unwarranted messages translating to a no-op.  In a program you may want an option to enter a diagnostic mode which reflects errant messages to a dead letter office, but this is rarely useful in a production system. 

These rules of message management are critical to the operation of a self healing system.  As any given recipient may or may not exist when the message is sent, waiting for a reply will result in a cascade failure.  As more services enter into a wait state, the availability of the system at large suffers.  The key design principal here is leave a message, if they need to they'll call you back.  Similarly, as a recipient, you need to be graceful in how you handle messages. If you model a state machine in which a progression of states is desired, the best approach is to ignore messages for those states to which the system can't transition.  This allows senders to send the same message than once.  Part of the reason a message may be delivered more than once is a forwarding agent may experience a failure immediately after sending and not preserve the sent status.  The result being that when it recovers it may resend the message.  This improves the reliability of the system as repetitious sending of a message may not trigger an error state.  By the same token, error recovery can ensure delivery at least once. 

Beyond this, the design of the system can take these same principles and apply them throughout the application stack.   Storage, messaging, processing, remote connections, load balancing, searching, should all embody the same precept that interchangeable parts can be added or removed at any time.  To realize this in a production environment requires that all data is stored redundantly and that locality is maintained by shifting processing to where the data is.   If you look at Amazon Dynamo inspired data stores, you'll see both these properties at work. Riak in particular is a perfect example.   In Riak, each node communicates via messaging to each other, and can ensure the availability of data by copying partitions about the ring. In this design the performance of the system improves as the number of nodes are increased as the partitioning more evenly distributes data across more shards.  This means that a client application will automatically distribute I/O and processing across multiple machines without having to exert any effort.  Riak self heals by each node communicating with each other and rebalancing the rings in event of the loss of a node. 

The same principles can apply to database design.  Take a multi-master configuration of PostgreSQL with Skytools as an example.  We might configure our master databases with a guid() function which starts at a known offset:

A = 1<<63
B  = 1<< 62
C = 1 << 61
D = 1 << 60
E = 1 << 59

We may also in this guid() function reserve the top 16 bits for our database ID, meaning we could have up to 64k shards, but in our current scheme we're just using 5.  We would then mask out the lower 48 bits as the local autoincrement value that produces our monotonic function.  This means each guid gives us a location of origin as well as a relative time stamp.  Each ID range is large enough to saturate the memory of any system you throw at it, and unlikely to run out over time.  Since each write master is localized we can also safely conjoin the tables without fear of collision. On the read slaves we can set up replication to map to a table $(hostname)_$(tablename) so if we have a customers table on A we would have a A_customers on each read slave, that had the contents of A's customers table. Using table inheritance on the read slave we can create a public.customers table that inherits from A_customers, B_customers, etc.  and the net result being a customers table that spans all 5 write slaves.  If we setup a round robin proxy like pgbouncer we can simply have our insert, update, and delete queries ping a write database and our selects route to our read cluster.  But there are a few caveats to this scheme:

1.) joins in writes may not work as intended
2.) updates are generally a bad idea as they require consistent routing
3.) deletes are generally a bad idea as they require consistent routing. 

The reason for these issues is that each partition only has 1/5th of the data. To make things work, you need to use the locality bits to consistently write data to the same node. This means that each partition can not have data referencing a foreign key in another partition.   But there is a solution to all of these problems: append only databases. 

An append only database has only two operations: insert and select. Rather than structuring your data as specific records, you organize them as families of records related by a common criteria over time.  You can think of it as keeping a history of an object, and selecting the most relevant one for any point in time. This gives you two interesting properties:

1.) data structures are immutable
2.) common data can be shared
3.) they are hashable

Immutability means that any record can not be changed once created. To update a record one creates a new record which supercedes the old by virtue of being more recent. B-tree and Trie data structures are often used to represent these sorts of data models as the underlying nodes are effectively atomic, but the aggregation a tree of values.  The fact that a subtree is immutable means it can be safely shared among many objects that have the same property set. This means you need only a pointer to a record family that is common across many objects.  This adds to the searchability of the data set as each subtree can be hashed to generate a Merkel tree.  In this way the differences between any two objects can be reduced to comparing the hash of their hashes. 

From an operational perspective you are trading space for ease of replication.  You can't have conflict issues as each entity is unique and immutable. This means synchronization is easy if you replicate in order each partition.  Integrity is on a per object level, so all foreign constraints are also strictly backwards facing.  In this respect it mimics a very early database system, the Forth dictionary.  Sometimes described as a hyper-static context, you can think of a Forth dictionary as an append only list of definitions which is always searched in reverse order (tail first).  Forth terms may only reference words which were defined previously, and if a term is redefined all definitions referencing the prior definition will continue to reference that definition.  This model allows for updating a logical entity with the physical entities remaining involatile. 

CouchDB also uses this model of append only database with incremental index file deltas appended at the end of each commit.  This allows Couch to safely replicate a database eventually to dozens of nodes via HTTP and at some point reach consistency. It also allows it to merge data from multiple databases, as each document receives a UUID, which means it is practically safe to move any document to any other database in history. The consistency obtained is not necessarily binary identical databases, but a logically consistent document store with equivalent indexes. 

The major downside of this design is in long lived databases, CouchDB will like all append only databases require a garbage collection phase known as compaction.  In the compaction phase, you copy the contents of the entire database to a new node, but only copying documents referenced by the most current index. A delta import is then done to add the newest documents as they come in.  Depending on the size of you data model and number of documents, archival needs, and usage patterns, you may need to do a compaction frequently or almost never.  

 In one append only distributed datastore solution I built for a online boxing game ca. 2004, our C++ solution stored every game event ever played on a pair of $1000 commodity servers, and provided all user registration, league, and game data for all of beta.  We safely projected it would take a full year of production before we even had to consider adding RAM.   We could add up to 64k servers to that one object space, each with 48bits of address space (forget IDs, we used physical addresses in 64k mmapped files).  Even with an append only model, and never recycling we could support every person on the planet playing our game 24/7/365 for a decade before a redesign would be necessary.  We would also all have been multi-billionaires had that problem happened, so we felt confindent we could hire someone else to solve it if it happened.   The combination of partitioning scheme, consistent routing among a ring or redundant servers for each partition, and append only data structure enabled our games to scale simply by adding a new box in the mix, and kick starting it via PXE boot.   If a box failed, yank it, PXE boot a replacement, and when replication finished you were back at strength.   Self Healing + an application suite designed to run within the constraints necessary to make healing quick and painless meant we could provide 24/7 support with 3 guys.   I never had a pager ring once  other than the automated tests of the pager service to verify it was still working.