Many to Many Services

Over the past two decades, I have spent the majority of my waking hours thinking about how to build applications that bring many users together with many different service providers. Take the case of the poker company, where there was not only many humans playing poker at a time, but also many AIs which managed the tables, ran the bank and managed accounts, and also AIs which played poker as well to flesh out tables (also useful for testing load). In that infrastructure everything revolved around ensuring messages reliably made it to all interested parties, and the servers and clients cooperatively managed load. The smart server/smart client model works spectacularly well when reliability is measured in terms of sub-second outages.

Designing systems in which failure is the expected mode of operation, it is important to look at opportunities to glean system information without doing additional work. Using a client side timer to specifically test connectivity isn't a great use of resources, but using a periodic state transfer on a persistent connection with a timeout probably is. When a disconnect is detected, you don't want to rely upon a set progression of fail overs, as that will generate a stampede, but rather can connect to a random endpoint and then discover a gathering point. I used this approach in several game clients to avoid the thundering herd problem when an entire datacenter went down. Combined with a retry backoff and a client with a limited memory, individual server failures would produce an even increase of load across the clusters. Then after a brief spike in activity, consistent hashing on the server side would redistribute the clients evenly across the clusters.

Having a plan for how you respond to failure also involves learning to avoid generating spikes. One of the key statistical tools is randomness. It allows you to flatten load for homogeneous requests. Randomness does not however help you flatten load for heavy requests, and can in fact cause server failure intermittently if it results in some particularly heavy requests to pool at a single endpoint. Part of the logic used in the game's connection layer relied upon the fact that the initial random request would always generate a redirect as it's response. On the serverside, the calculation to determine the location of the redirect could be done entirely in memory and in constant time. These two factors were important for balancing the load if more than a single server hosting 200-2000 concurrent connections web down.

The real smarts of this system came in with setting up the initial load balancing. While clients would hash to a given room, the servers would communicate load via a shared connection, allowing the other servers in the pool to volunteer for the creation of an endpoint. Effectively, every time a client requested a new chat room be created the servers would draw straws weighted by their load, and the shortest straw would host the room. The client would then be redirected to the newly created endpoint, which itself was managed as a separate process on the specified server.

For rooms which could not fit on a single server, a cohort of servers could pool their connections and used a relay proxy bot to shuffle communications between them. By adding the S2S plumbing, geographically disparate shared spaces could be created, with multiple points of entry. The persistent object store itself worked on this principal, where 3 machines would each join a room and evaluate each transaction on the objects in their address space. Since each object memory was append only, the response consisted of whever got a result first. Any server missing the data would copy down the object when it was accessed. This ensured that as long as one pool member had a copy, all would have a copy if the data was relevant. Storage pressure was never addressed in so far that the address space was so huge, it was easier to migrate all of the active objects to new partitions and recycle the old spaces. This technique would have failed when the partition space reached half saturation 32k partitions, but that would imply 96k servers each hosting full images. In otherwords, a problem so far removed from the realistic problem space as to not be worth considering.

And that is probably the key concept in scaling, there exists a point Omega, such that beyond which nothing of value is gained. If your system's omega is outside the realm of realistic possibilities, your system is said to scale. This also works with an alpha value, usually 1, below which no useful work can be done. But that is a post for another day.