The Homeostatic Solution
Self governing solutions are of particular interest to me these days. The question of how does one engineer a solution that produces a stable, steady state behavior under normal operating conditions, but is capable of responding to changing conditions by reaching a new normal automatically. That's really the holy grail of system design when you think of it. Once you have an adaptable & scalable system, that reaches a steady state solution regardless of initial conditions requires both a capacity for rapid growth as well as graceful decay. The system must be capable of suffering from multiple organ failures but still keep on ticking. It must also be able to recover from catastrophic load automatically. It doesn't need to provide 100% uptime or even perfect service (free of error) but must be resilient.
So what does it take to build such a system?
- interchangeable parts - replace parts and repair offline
- battlefield promotions - any node should be able to assume new responsibilities
- self-awareness - it should gather and analyze metrics about itself to automatically adapt
- communication - component should message each other in a loosely coupled manner
- duplicative but not redundant - no single point of failure, but only as much as necessary
What these things mean in practical terms is that there is a minimal size to any system that can be built, as well as, a maximal size for any given set of demands. It also means that each individual node must have the ability to become any other node (but not necessarily after differentiation occurs). For example, a component may begin life as a web server initially, get promoted to a monitoring node for a subgraph of the cluster, and finally assume command and control responsibility for building new web servers in it's part of the network based on load. These properties of the system as not intrinsic to the node itself, but rather must be able to emerge organically through use.
Messaging is necessary as it is the only manner in which a distributed system can share state and maintain distinct separation between components. If you weld to wires together, you create a single system. If you however just touch two wires together, you can maintain physically distinct entities but still allow for the flow of electromagnetic payloads between them. This sort of loose coupling is important if you want to be able to yank out cables and swap out physical parts. Messaging too is necessary as this loose coupling means that any node can loose access to it's neighbor at any time. As such, it becomes incumbent upon the system to be aware of which nodes are connected and what routes are available at any time.
Monitoring provides this sort of internal awareness. Much like you know when you are hot or cold, hungry or thirsty, so too should your system know if it is underload or not. Rather than alerting a sysadmin, the system should be able to reconfigure itself to meet it's own needs. These changes need not be terribly clever to be effective. Adding and removing node types depending on job requests with hard quotas and a priority list can go a long way. If spinning up a few extra worker nodes for batch processing is cheap enough, no one will even notice this behavior. If code is deployed everywhere, this can be the case. If data is distributed and the code can chase the data, great efficiencies from locality. Reconfigurability makes it possible to move code to data when usage patterns change or requirements alter load.
Having built several such systems over the years, these requirements challenge a lot of what is standard practice in the industry. The biggest change for systems administrators is giving up control. You don't know what the system is doing at any time, but with all the monitoring you don't need that. The benefits are huge though, simpler maintenance (just plug in a new box) and fewer 5am phone calls (self healing).